hippynn

Full Documentation for hippynn package. Click here for a summary page.

The hippynn python package.

settings: Values for the current hippynn settings. See Library Settings for a description.

class Database(arr_dict: dict[str, ~numpy.ndarray], inputs: list[str], targets: list[str], seed: [<class 'int'>, <class 'numpy.random.mtrand.RandomState'>, <class 'tuple'>], test_size: float | int = None, valid_size: float | int = None, num_workers: int = 0, pin_memory: bool = True, allow_unfound: bool = False, auto_split: bool = False, device: ~torch.device = None, dataloader_kwargs: dict[str, object] = None, quiet=False)[source]

Bases: object

Class for holding a pytorch dataset, splitting it, generating dataloaders, etc.”

add_split_masks(dict_to_add_to=None, split_prefix=None)[source]

Add split masks to the dataset. This function is used internally before writing databases.

When using the dict_to_add_to parameter, this function writes numpy arrays. When adding to self.splits, this function writes tensors. :param dict_to_add_to: where to put the split masks. Default to self.splits. :param split_prefix: prefix for mask names :return:

get_device() → device[source]

Determine what device the database resides on. Raises ValueError if multiple devices are encountered.

Returns:: device.

make_automatic_splits(split_prefix=None, dry_run=False)[source]

Split the database automatically. Since the user specifies this routine, it fails pretty strictly.

Parameters:

split_prefix – None, use default. If otherwise, use this prefix to determine what arrays are masks.
dry_run – Only validate that existing split masks are correct; don’t perform splitting.

Returns:

make_database_cache(file: str = './hippynn_db_cache.npz', overwrite: bool = False, **override_kwargs) → Database[source]

Cache the database as-is, and re-open it.

Useful for creating an easy restart script if the storage space is available. The new datatbase will by default inherit the properties of this database.

usage: >>> database = database.make_database_cache()

Parameters:

file – where to store the database
overwrite – whether to overwrite an existing cache file with this name.
override_kwargs – passed to NPZDictionary instead of the current database settings.

Returns:

The new database created from the cache.

make_explicit_split(split_name: str, split_indices: ndarray)[source]

Parameters:

split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_indices – the indices of the items for the split

Returns:

make_explicit_split_bool(split_name: str, split_mask: ndarray | tensor)[source]

Parameters:

split_name – name for split, typically ‘train’, ‘valid’, ‘test’
split_mask – a boolean array for where to split

Returns:

make_generator(split_name: str, evaluation_mode: str, batch_size: int | None = None, subsample: float | bool = False)[source]

Makes a dataloader for the given type of split and evaluation mode of the model.

In most cases, you do not need to call this function directly as a user.

Parameters:

split_name – str; “train”, “valid”, or “test” ; selects data to use
evaluation_mode – str; “train” or “eval”. Used for whether to shuffle.
batch_size – passed to pytorch
subsample – fraction to subsample

Returns:

dataloader containing relevant data

make_random_split(split_name: str, split_size: int | float)[source]

Make a random split using self.random_state to select items.

Parameters:

split_name – String naming the split, can be anything, but ‘train’, ‘valid’, and ‘test’ are special.
split_size – int (number of items) or float<1, fraction of samples.

Returns:

make_trainvalidtest_split(*, test_size: int | float, valid_size: int | float)[source]

Make a split for train, valid, and test out of any remaining unsplit entries in the database. The size is specified in terms of test and valid splits; the train split will be the remainder.

If you wish to specify precise rows for each split, see make_explict_split or make_explicit_split_bool.

This function takes keyword-arguments only in order to prevent confusion over which size is which.

The types of both test_size and valid_size parameters must match.

Parameters:

test_size – int (count) or float (fraction) of data to assign to test split
valid_size – int (count) or float (fraction) of data to assign to valid split

Returns:

None

remove_high_property(key: str, atomwise: bool, norm_per_atom: bool = False, species_key: str = None, cut: float | None = None, std_factor: float | None = 10, norm_axis: int | None = None)[source]

For removing outliers from a dataset. Use with caution; do not inadvertently remove outliers from benchmarks!

The parameters cut and std_factor can be set to None to avoid their steps. the per_atom and atom_var properties are exclusive; they cannot both be true.

Parameters:

key – The property key in the dataset to check for high values
atomwise – True if the property is defined per atom in axis 1, otherwise property is treated as whole-system value
norm_per_atom – True if the property should be normalized by atom counts
species_key – Which array represents the atom presence; required if per_atom is True
cut – If values > mu + cut, the system is removed. The step done first.
std_factor – If (value-mu)/std > std_fact, the system is trimmed. This step done second.
norm_axis – if not None, the property array is normed on the axis. Useful for vector properties like force.

Returns:

send_to_device(device: device = None)[source]

Move the database to an accelerator device if possible. In some circumstances this can accelerate training.

Note

If the database is moved to a GPU, pin_memory will be set to False and num_workers will be set to 0.

Parameters:: device – device to move to, if None, try to auto-detect.
Returns:

sort_by_index(index_name: str = 'indices')[source]

Sort arrays in each split of the database by an index key.

The default is ‘indices’, also possible is ‘split_indices’, or any other variable name in the database.

Parameters:: index_name
Returns:: None

split_the_rest(split_name: str)[source]

trim_by_species(species_key: str, keep_splits_same_size: bool = True)[source]

Remove any excess padding in a database.

Parameters:

species_key – what array to use to mark atom presence.
keep_splits_same_size – true: trim by the minimum amount across splits, false: trim by the maximum amount for each split.

Returns:

None

write_h5(split: str | None = None, h5path: str | None = None, species_key: str = 'species', overwrite: bool = False)[source]

Write this database to the pyanitools h5 format. See hippynn.databases.h5_pyanitools.write_h5() for details.

Note: This function will error if h5py is not installed.

Parameters:

split
h5path
species_key
overwrite

Returns:

write_npz(file: str, record_split_masks: bool = True, compressed: bool = True, overwrite: bool = False, split_prefix: str | None = None, return_only: bool = False)[source]

Parameters:

file – str, Path, or file object compatible with np.save
record_split_masks – whether to generate and place masks for the splits into the saved database.
compressed – whether to use np.savez_compressed (True) or np.savez
overwrite – Whether to accept an existing path. Only used if fname is str or path.
split_prefix – optionally override the prefix for the masks computed by the splits.
return_only – if True, ignore the file string and just return the resulting dictionary of numpy arrays.

Returns:

property var_list

class DirectoryDatabase(directory, name, inputs, targets, *args, quiet=False, allow_unfound=False, **kwargs)[source]

Bases: Database, Restartable

Database stored as NPY files in a directory.

Parameters:

directory – directory path where the files are stored
name – prefix for the arrays.

This function loads arrays of the format f”{name}{db_name}.npy” for each variable db_name in inputs and targets.

Other arguments: See Database.

Note

This database loader does not support the allow_unfound setting in the base Database. The variables to load must be set explicitly in the inputs and targets.

get_file_dict(directory, prefix)[source]

load_arrays(directory, name, inputs, targets, quiet=False, allow_unfound=False)[source]

class GraphModule(required_inputs, nodes_to_compute)[source]

Bases: Module

extra_repr()[source]

Set the extra representation of the module.

To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.

forward(*input_values)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_module(node)[source]

node_from_name(name)[source]

print_structure(suppress=True)[source]: Pretty-print the structure of the nodes and links comprising this graph.

class IdxType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Atoms = 'Atoms'

MolAtom = 'MolAtom'

MolAtomAtom = 'MolAtomAtom'

Molecules = 'Molecules'

NotFound = 'NOT FOUND'

Pair = 'Pair'

QuadMol = 'QuadMol'

QuadPack = 'QuadPack'

Scalar = 'Scalar'

class NPZDatabase(file, inputs, targets, *args, allow_unfound=False, quiet=False, **kwargs)[source]

Bases: Database, Restartable

load_arrays(file, inputs, targets, allow_unfound=False, quiet=False)[source]

class Predictor(inputs, outputs, return_device=device(type='cpu'), model_device=None, requires_grad=False, name=None)[source]

Bases: object

The predictor is a dressed-up GraphModule which gives access to the outputs of individual nodes.

In many cases you may simply want to use the from_graph method to generate a predictor.

The predictor will take the model graph, convert the output nodes into a padded index state, and build a new graph for these operations.

add_output(node)[source]

apply_to_database(db, **kwargs)[source]: Note: kwargs are passed to self.__call__, e.g. the batch_size parameter.

classmethod from_graph(graph, additional_outputs=None, **kwargs)[source]

Construct a new predictor from an existing GraphModule.

Parameters:

graph – graph to create predictor for. The predictor makes a shallow copy of this graph. e.g. it may move parameters from that graph to the model_device.
additional_outputs – List of additional nodes to include in outputs
kwargs – passed to __init__

Returns:

predictor instance

predict_all(node_values)[source]

predict_batched(node_values, batch_size)[source]

to(*args, **kwargs)[source]

wrap_outputs(out_dict)[source]

property inputs

property model_device

property outputs

active_directory(dirname, create=None)[source]

Context manager for temporarily switching the current working directory.

If create is None, always succeed. If create is True, only succeed if the directory does not exist, and create one. If create is False, only succeed if the directory does exist, and switch to it.

In other words, use create=True if you want to force that it’s a new directory. Use create=False if you want to switch to an existing directory. Use create=None create a directory if you are okay with either alternative.

Parameters:

dirname – directory to enter
create – (None,True,False)

Returns:

None

Raises:

If directory status not compatible with create constraints.

hierarchical_energy_initialization(energy_module, database=None, trainable_after=False, decay_factor=0.01, encoder=None, energy_name=None, species_name=None, peratom=False)[source]

Computes values for the non-interacting energy using the training data.

Parameters:

energy_module – HEnergyNode or torch module for energy prediction
database – InterfaceDB object to get training data, required if model contains E0 term
trainable_after – Determines if it should change .requires_grad attribute for the E0 parameters
decay_factor – change initialized weights of further energy layers by df**N for layer N
encoder – species encoder, can be auto-identified from energy node
energy_name – name for the energy variable, can be auto-identified from energy node
species_name – name for the species variable, can be auto-identified from energy node
peratom

Returns:

None

load_checkpoint(structure_fname: str, state_fname: str, restart_db=False, map_location=None, model_device=None, **kwargs) → dict[source]

Load checkpoint file from given filename.

For details more information on to use this function, see Restarting training.

Parameters:

structure_fname – name of the structure file
state_fname – name of the state file
restart_db – restore database or not, defaults to False
map_location – device mapping argument for torch.load, defaults to None
model_device – automatically handle device mapping. Defaults to None, defaults to None

Returns:

experiment structure

load_checkpoint_from_cwd(map_location=None, model_device=None, **kwargs) → dict[source]

Same as load_checkpoint, but using default filenames.

Parameters:

map_location (Union[str, dict, torch.device, Callable], optional) – device mapping argument for torch.load, defaults to None
model_device (Union[int, str, torch.device], optional) – automatically handle device mapping. Defaults to None, defaults to None

Returns:

experiment structure

Return type:

dict

load_model_from_cwd(map_location=None, model_device=None, **kwargs) → GraphModule[source]

Only load model from current working directory.

Parameters:

map_location (Union[str, dict, torch.device, Callable], optional) – device mapping argument for torch.load, defaults to None
model_device (Union[int, str, torch.device], optional) – automatically handle device mapping. Defaults to None, defaults to None

Returns:

model with reloaded parameters

log_terminal(file, *args, **kwargs)[source]

Param:: file: filename or string
Param:: args: piped to open(file,*args,**kwargs) if file is a string
Param:: kwargs: piped to open(file,*args,**kwargs) if file is a string

Context manager where stdout and stderr are redirected to the specified file in addition to the usual stdout and stderr. The manager yields the file. Writes to the opened file object with “with log_terminal(…) as <file>” will not automatically be piped into the terminal.

reload_settings(**kwargs)[source]

Attempt to reload the hippynn library settings.

Settings sources are, in order from least to greatest priority:

Default values
The file ~/.hippynnrc, which is a standard python config file which contains
variables under the section name [GLOBALS].
A file specified by the environment variable HIPPYNN_LOCAL_RC_FILE
which is treated the same as the user rc file.
Environment variables prefixed by HIPPYNN_, e.g. HIPPYNN_DEFAULT_PLOT_FILETYPE.
Keyword arguments passed to this function.

Parameters:: kwargs – explicit settings to change.
Returns:

set_custom_kernels(active: bool | str = True) → str[source]

Activate or deactivate custom kernels for interaction.

This function changes the global variables:

Special non-implementation-name values are:

True: - Use the best GPU kernel from recommended implementations, error if none are available.
False: - equivalent to “pytorch”
“auto”: - Equivalently to True if recommended is available, else equivalent to “pytorch”

Parameters:: active – implementation name to activate
Returns:: active, actual implementation selected.

setup_and_train(training_modules: TrainingModules, database: Database, setup_params: SetupParams, store_all_better=False, store_best=True, store_every=0)[source]

Param:: training_modules: see setup_training()
Param:: database: see train_model()
Param:: setup_params: see setup_training()
Param:: store_all_better: Save the state dict for each model doing better than a previous one
Param:: store_best: Save a checkpoint for the best model
Param:: store_every: Save a checkpoint for every certain epochs
Returns:: See train_model()

Shortcut for setup_training followed by train_model.

Note

The training loop will capture KeyboardInterrupt exceptions to abort the experiment early. If you would like to gracefully kill training programmatically, see train_model() with callbacks argument.

Note

Saves files in the current running directory; recommend you switch to a fresh directory with a descriptive name for your experiment.

setup_training(training_modules: TrainingModules, setup_params: SetupParams)[source]

Prepares training_modules for training with experiment_params.

Param:: training_modules: Tuple of model, training loss, and evaluation losses (Can be built from graph using graphs.assemble_training_modules)
Param:: setup_params: parameters controlling how training is performed (See SetupParams)

Roughly:

sets devices for training modules
if no controller given:
- instantiates and links optimizer to the learnable params on the model
- instantiates and links scheduler to optimizer
- builds a default controller with setup params
creates a MetricTracker for storing the training metrics

Returns:: (optimizer,evaluator,controller,metrics,callbacks)

test_model(database, evaluator, batch_size, when, metric_tracker=None)[source]

Tests the model on the database according to the model_evaluator metrics. If a plot_maker is attached to the model evaluator, it will make plots. The plots will go in a sub-folder specified by when the testing is taking place. The results are then printed.

Parameters:

database – The database test the model on.
evaluator – The evaluator containing model and evaluation losses to measure.
when – A string to specify what plots are currently to be used.
metric_tracker – (Optional) metric tracker to save metrics on. If not provided, a blank one will be constructed.

Returns:

metric tracker

train_model(training_modules, database, controller, metric_tracker, callbacks, batch_callbacks, store_all_better=False, store_best=True, store_every=0, store_structure_file=True, store_metrics=True, quiet=False)[source]

Performs training loop, allows keyboard interrupt. When done, reinstate the best model, make plots and metrics over time, and test the model.

Parameters:

training_modules – tuple-like of model, loss, and evaluator
database – Database
controller – Controller
metric_tracker – MetricTracker for storing model performance
callbacks – callbacks to perform after every epoch.
batch_callbacks – callbacks to perform after every batch
store_best – Save a checkpoint for the best model
store_all_better – Save the state dict for each model doing better than a previous one
store_every – Save a checkpoint for every certain epochs
store_structure_file – Save the structure file for this experiment
store_metrics – Save the metric tracker for this experiment.
quiet – If True, disable printing during training (still prints testing results).

Returns:

metric_tracker

Note

callbacks take the form of an iterable of callables and will be called with cb(epoch,new_best)

epoch indicates the epoch number
new_best indicates if the model is a new best model

Note

batch_callbacks take the form of an iterable of callables and will each be called with cb(batch_inputs, batch_model_outputs, batch_targets)

Note

You may want to make your callbacks store other state, if so, an easy way is to make them a callable object.

Note

callback state is not managed by hippynn. If your wish to save or load callback state, you will have to manage that manually (possibly with a callback itself).

Subpackages

Submodules