Abstract

Data-driven approaches have been applied to reduce the cost of accurate computational studies on materials, by using only a small number of expensive reference electronic structure calculations for a representative subset of the materials space, and using them to train surrogate models that predict inexpensively the outcome of such calculations on an extensive space of configurations spanned by the subset. The way materials structures are processed into a numerical description as input of machine learning algorithms is crucial to obtain efficient models, and has advanced significantly in the last decade, putting forth enhancements in the embedding of geometrical and chemical information. Despite the rapid development of offloading calculations to more dedicated hardware, these enhancements nevertheless substantially increase the cost of the numerical description, which remains a crucial factor in simulations. It is therefore vital to delve deeper into the design space of representations to understand the type of information the numerical descriptions encapsulate. Insights from such analyzes aid in making more informed decisions regarding the trade-off between accuracy and performance. While a substantial amount of work has been undertaken to compare representations concerning their structure-property relationship, a thorough exploration of the inherent nature and the information capacity of these representations remains mostly unexplored. This thesis introduces a set of measures that facilitate quantitative analysis concerning the relationship between features, thereby assisting in such decision-making processes and providing valuable insights to the academic community. We demonstrate how these measures can be applied to analyze representations that are built in terms of many-body correlations of atomic densities. For this form of featurization, we investigate the impact of different choices for the functional form, the basis functions, and the induced feature space determined by the similarity measure and metric space. We employ these measures subsequently on featurizations with basis functions optimized to the dataset to show the higher information capacity in comparison to an unoptimized. We show how these well-established optimization methods based on the covariance or correlation matrix, such as principal component analysis, can be applied in a manner that preserves symmetries. The scheme utilizes splines to bypass the optimization during prediction time, permitting the adoption of more expansive optimization methods in the future. Complementing these efforts is the integration of the developed methods into well-maintained and thoroughly documented packages, facilitating advancements and incorporation into new workflows. As a showcase of this development, we present a framework for running metadynamics simulations that incorporates a machine learning interatomic potential into the molecular dynamics engine LAMMPS to exploit its message-passing interface implementation of the domain decomposition. This enabled us to study finite-size effects in the paraelectric-ferroelectric phase transition in barium titanate. Born out of this software development, a way forward is presented for a more modular software ecosystem for the flexible construction of data-driven interatomic potentials with immediate deployment into simulations.

Details