Input File Specification (input.toml)
The workflow is fully controlled through a single TOML configuration file.
Each stage reads only the parameters relevant to it.
Execution Model
Each stage defines both:
what it does (physics / workflow logic)
how it runs (CPU / GPU / parallelization)
There is no global hardware section. Instead:
[scan]controls screening[relax]controls structural relaxation[bandgap]controls band gap prediction
Each stage independently defines:
device(cpu or cuda)gpu_id(for GPU execution)
This design gives full flexibility and avoids cross-stage conflicts.
Execution Behavior
Stage |
Device |
Strategy |
|---|---|---|
scan |
CPU |
multiprocessing |
relax |
CPU / GPU |
workers or single GPU |
bandgap |
CPU |
multiprocessing |
bandgap |
GPU |
batched inference |
This design allows each stage to control its own parallelization strategy while sharing a consistent hardware configuration.
All paths are interpreted relative to the directory containing input.toml.
The following sections are supported:
[references][structure][doping][generate][scan][relax][filter][bandgap][formation][database](optional)
Not all sections are required for every stage. Each stage reads only what it needs.
[references]
Step 00 — Reference construction and relaxation.
This stage prepares all thermodynamic reference structures and writes:
reference_structures/reference_energies.json
It performs:
Relaxation of the host oxide unit cell
Construction and relaxation of the host supercell
Relaxation of metal reference phases (metal mode)
Relaxation of oxide reference phases (oxide mode)
Relaxation of O₂ gas (oxide mode)
Storage of all relaxed POSCAR files for reuse
Common Parameters
reference_mode (string)
Choose reference scheme:
"metal""oxide"
skip_if_done (boolean)
Skip reconstruction if JSON cache exists.
fmax (float)
Force convergence criterion used in relaxation (eV/Å).
supercell (array of 3 integers)
Supercell used later for doping. The host supercell is constructed and relaxed at this stage.
host (string)
Chemical formula of the host oxide (e.g. "SnO2").
host_dir (string)
Directory containing <host>.POSCAR.
Metal Reference Mode
Used when reference_mode = "metal".
metal_ref (array of strings)
List of metal element symbols used as reference phases.
metals_dir (string)
Directory containing <Element>.POSCAR files.
Example:
reference_structures/metals/Sn.POSCAR
reference_structures/metals/Sb.POSCAR
Oxide Reference Mode
Used when reference_mode = "oxide".
oxides_ref (array of strings)
List of dopant oxide formulas (e.g. "Sb2O5").
oxides_dir (string)
Directory containing oxide POSCAR files.
gas_ref (string)
Gas reference formula (typically "O2").
gas_dir (string)
Directory containing gas POSCAR file.
oxygen_mode (string)
Currently supports:
"O-rich""O-poor"
muO_shift_ev (float)
Optional chemical potential shift applied to oxygen (eV).
[structure]
Defines workflow I/O only.
outdir (string, default: “random_structures”)
Directory where generated composition folders are written.
This directory becomes the root for:
Step 01 outputs
Step 02–06 per-composition subfolders
[doping]
Defines substitutional doping behavior.
host_species (string, required)
Element symbol of the host species to be substituted.
mode (string)
Defines doping mode:
"explicit"— user provides exact compositions."enumerate"— workflow constructs compositions combinatorially.
Explicit Mode
compositions (array of tables)
List of dictionaries:
compositions = [
{ Sb = 5 },
{ Sb = 5, Zr = 5 }
]
Percentages are defined relative to host sites.
Enumerate Mode
dopants (array of strings)
Allowed dopant elements.
must_include (array of strings)
Dopants that must appear in each composition.
max_dopants_total (integer)
Maximum number of distinct dopants per structure.
allowed_totals (array of floats)
Allowed total dopant percentages.
levels (array of floats)
Discrete concentration values per dopant.
[generate]
Step 01 — Structure generation.
This step generates one doped structure per composition by substituting
host atoms inside the relaxed host supercell produced by refs-build.
Important
refs-buildmust be executed first.The relaxed host supercell is loaded from:
reference_structures/reference_energies.jsonNo supercell is constructed in this step.
Parameters
seed_base (integer)
Base seed used for deterministic random substitution.
Each composition generates a stable hash-based seed.
poscar_order (array of strings, optional)
Defines element ordering in written POSCAR files.
If empty, pymatgen default ordering is used.
Example:
poscar_order = ["Zr", "Ti", "Sb", "Sn", "O"]
clean_outdir (boolean, default: true)
If true, existing output directory is deleted before writing new structures.
Output
For each composition:
<outdir>/<composition_tag>/
POSCAR
metadata.json
[scan]
Step 02 — Dopant configuration prescreening using machine-learning interatomic potentials.
For each generated structure folder inside [structure].outdir:
Generates doped configurations on the cation sublattice
Identifies symmetry-unique configurations
Evaluates single-point energies using the selected ML backend
Ranks configurations by energy
Keeps the lowest-energy
topkcandidatesWrites candidate folders and ranking files
Depending on the scan mode, configurations are obtained either by:
exact symmetry-unique enumeration
random symmetry-unique sampling
This step operates only on subfolders created in Step 01.
Parameters
backend (string, default: “m3gnet”)
Selects the ML model used for energy prediction.
Available options:
"m3gnet"— TensorFlow-based universal interatomic potential"uma"— FAIR-Chem universal model (requires Hugging Face access)"mace"— MACE foundation models (Materials Project / OMAT / MPA)"grace"— GRACE graph neural network models (if installed)
Each backend has its own supported models and execution characteristics.
model (string, default: “default”)
Specifies the pretrained model variant used by the selected backend.
Behavior depends on backend:
m3gnet: -"default"→ loads the standard M3GNet universal modeluma: -"uma-s-1p2"-"uma-s-1p1"-"uma-m-1p1"mace: -"small"-"medium"-"large"-"small-mpa-0"-"medium-mpa-0"-"large-mpa-0"-"small-omat-0"-"medium-omat-0"grace: -GRACE-1L-OMAT-GRACE-1L-OMAT-M-base-GRACE-1L-OMAT-M-GRACE-1L-OMAT-L-base-GRACE-1L-OMAT-L-GRACE-2L-OMAT-GRACE-2L-OMAT-M-base-GRACE-2L-OMAT-M-GRACE-2L-OMAT-L-base-GRACE-2L-OMAT-L-GRACE-1L-OAM-GRACE-1L-OAM-M-GRACE-1L-OAM-L-GRACE-2L-OAM-GRACE-2L-OAM-M-GRACE-2L-OAM-L-GRACE-1L-SMAX-L-GRACE-1L-SMAX-OMAT-L-GRACE-2L-SMAX-M-GRACE-2L-SMAX-L-GRACE-2L-SMAX-OMAT-M-GRACE-2L-SMAX-OMAT-L
task (string, default: “”)
Optional task specification (used only for certain backends).
umarequires a task: -"omat"-"oc20"-"oc22"-"oc25"-"omol"-"odac"-"omc"m3gnet,mace,grace: - Not used → keep empty ("")
poscar_in (string, default: “POSCAR”)
Filename inside each composition folder used as input.
topk (integer)
Number of lowest-energy configurations retained.
symprec (float)
Tolerance used for symmetry detection in SpacegroupAnalyzer.
mode (string, default: “auto”)
Scan strategy.
Possible values:
autoAutomatically chooses exact enumeration for manageable problems and switches to sampling when the configuration space becomes too large.exactForces full symmetry-unique enumeration of all configurations.sampleUses random symmetry-unique sampling instead of full enumeration.
max_enum (integer)
Maximum allowed number of raw combinatorial configurations in exact mode.
If this limit is exceeded and mode = "auto", the scan automatically switches
to sampling mode.
max_unique (integer)
Maximum allowed number of symmetry-unique configurations in exact mode.
Prevents excessive memory usage for very large configuration spaces.
device (string, default: “cpu”)
Execution device:
"cpu""cuda"
Behavior depends on backend:
m3gnet: - GPU mode uses a single worker (TensorFlow limitation)uma: - GPU supported via PyTorchmace: - GPU supported and recommended for performancegrace: - GPU support depends on model implementation
n_workers (integer)
Number of parallel worker processes.
Used only when
device = "cpu"Some backends may internally limit parallelism:
m3gnet(GPU): forces single workermace: typically runs efficiently in single-process mode
Ignored when
device = "cuda"(GPU mode uses a single worker)
chunksize (integer)
Chunk size used in the multiprocessing pool.
Relevant only when device = "cpu".
gpu_id (integer, default: 0)
GPU index used when device = "cuda".
anion_species (array of strings)
Species excluded from substitutional enumeration. Typically contains oxygen:
anion_species = ["O"]
host_species (from [doping])
Used to define the cation sublattice.
Must match the host element defined in [doping].
skip_if_done (boolean, default: true)
If true, skip a structure folder if ranking_scan.csv already exists.
Sampling parameters
Used when:
mode = "sample"or
mode = "auto"selects sampling for large configuration spaces
sample_budget (integer)
Maximum number of random sampling attempts.
sample_batch_size (integer)
Number of new symmetry-unique sampled configurations evaluated per batch.
sample_patience (integer)
Sampling stops after this many sampled configurations fail to improve the current best candidate.
sample_seed (integer)
Random seed used for reproducible sampling.
sample_max_saved (integer)
Maximum number of sampled canonical configurations stored to avoid duplicates.
Notes
If
device = "cuda", scan runs on a single GPU andn_workersis ignored.If
device = "cpu", parallelization is controlled vian_workersandchunksize.GPU mode is recommended for faster single-structure evaluation, while CPU mode scales better across many configurations.
The scan backend can be selected via
backend.Different backends have different accuracy/speed trade-offs:
m3gnet: stable, general-purposeuma: high-quality FAIR-Chem models (requires authentication)mace: fast and scalable foundation modelsgrace: advanced GNN-based models (experimental)
For large configuration spaces, sampling mode is recommended.
GPU acceleration is backend-dependent and may not always scale with multiprocessing.
Output
For each composition folder:
ranking_scan.csv
scan_summary.txt
candidate_001/01_scan/POSCAR
candidate_001/01_scan/meta.json
...
Each candidate folder contains:
symmetry-unique configuration
single-point energy (from selected backend)
dopant site signature
scan metadata
The following metadata is recorded:
backend
model
task
energy_sp_eV
configuration details
[relax]
Step 03 — Structural relaxation using machine-learning interatomic potentials.
This stage relaxes the low-energy candidates selected in Step 02.
For each structure folder in [structure].outdir, the workflow reads:
candidate_*/01_scan/POSCAR
and writes relaxed structures and metadata to:
candidate_*/02_relax/
Relaxation is performed using the selected ML backend together with an ASE optimizer.
Supported backends:
"m3gnet""uma""mace""grace"
Supported optimizers:
"bfgs""lbfgs""fire""mdmin""quasinewton"
Relaxation stops when either:
the maximum atomic force drops below
fmaxthe number of optimizer steps reaches
max_steps
Parameters
backend (string, default: “m3gnet”)
Selects the ML backend used for structural relaxation.
Available options:
"m3gnet"— TensorFlow-based universal interatomic potential"uma"— FAIR-Chem universal model (requires Hugging Face access)"mace"— MACE foundation models"grace"— GRACE graph neural network models (if installed)
model (string, default: “default”)
Specifies the pretrained model variant used by the selected relaxation backend.
Behavior depends on backend:
m3gnet: -"default"→ loads the standard pretrained M3GNet modeluma: -"uma-s-1p2"-"uma-s-1p1"-"uma-m-1p1"mace: -"small"-"medium"-"large"-"small-mpa-0"-"medium-mpa-0"-"large-mpa-0"-"small-omat-0"-"medium-omat-0"grace: -GRACE-1L-OMAT-GRACE-1L-OMAT-M-base-GRACE-1L-OMAT-M-GRACE-1L-OMAT-L-base-GRACE-1L-OMAT-L-GRACE-2L-OMAT-GRACE-2L-OMAT-M-base-GRACE-2L-OMAT-M-GRACE-2L-OMAT-L-base-GRACE-2L-OMAT-L-GRACE-1L-OAM-GRACE-1L-OAM-M-GRACE-1L-OAM-L-GRACE-2L-OAM-GRACE-2L-OAM-M-GRACE-2L-OAM-L-GRACE-1L-SMAX-L-GRACE-1L-SMAX-OMAT-L-GRACE-2L-SMAX-M-GRACE-2L-SMAX-L-GRACE-2L-SMAX-OMAT-M-GRACE-2L-SMAX-OMAT-L
task (string, default: “”)
Optional task specification used only for uma.
Allowed values for uma:
"omat""oc20""oc22""oc25""omol""odac""omc"
For m3gnet, mace, and grace, this parameter is ignored and should be left empty.
optimizer (string, default: “bfgs”)
ASE optimizer used during relaxation.
Available options:
"bfgs""lbfgs""fire""mdmin""quasinewton"
fmax (float)
Maximum force convergence criterion (eV/Å).
Relaxation is considered converged when the maximum atomic force falls below this threshold.
max_steps (integer, default: 300)
Maximum number of optimizer steps.
If this limit is reached before the force threshold is satisfied, the relaxation stops and the final structure is still written to disk.
device (string, default: “cpu”)
Execution device:
"cpu""cuda"
Behavior depends on backend and runtime environment. GPU execution uses a single effective worker.
gpu_id (integer, default: 0)
GPU index used when device = "cuda".
n_workers (integer)
Number of parallel relaxation workers (one candidate per worker process).
Relevant mainly when device = "cpu".
tf_threads (integer)
TensorFlow thread count per worker.
Mainly relevant for the m3gnet backend.
Keep small (typically 1) when using multiple workers.
omp_threads (integer)
OpenMP thread count per worker.
Keep small to avoid CPU oversubscription.
skip_if_done (boolean)
Skip an entire composition folder if ranking_relax.csv already exists.
skip_candidate_if_done (boolean)
Skip an individual candidate if both of the following already exist:
candidate_*/02_relax/meta.json
candidate_*/02_relax/POSCAR
Notes
If
device = "cuda", relaxation uses a single effective worker andn_workersis ignored.If
device = "cpu", parallelization is controlled vian_workers.The relaxed
POSCARfollows the species ordering defined by[generate].poscar_order.If
poscar_orderis empty, the default pymatgen ordering is used.
Output
For each candidate:
candidate_###/02_relax/POSCAR
candidate_###/02_relax/meta.json
For each structure folder:
ranking_relax.csv
The metadata records:
backend
model
task
optimizer
relaxed energy
convergence target
final maximum force
optimizer step count
convergence status
walltime
link to the original scan metadata
[filter]
Step 04 — Candidate selection.
mode (string)
"window""topn"
window_meV (float)
Energy window above the lowest relaxed energy (in meV).
Candidates with:
E_relaxed <= E_min + window_meV
are retained.
A value of 0 keeps only the lowest-energy structure.
If no candidate satisfies the filtering criteria, the workflow raises an error.
max_candidates (integer)
Number of candidates kept when mode = "topn".
skip_if_done (boolean)
Skip filtering if output exists.
[bandgap]
Step 05 — Bandgap prediction using a local ALIGNN model.
Requires environment variable ALIGNN_MODEL_DIR pointing to
a local ALIGNN model directory.
skip_if_done (bool)
If true, previously computed bandgap results are reused.
Behavior:
- If candidate_*/03_band/meta.json already exists, the stored bandgap value is reused and prediction is skipped for that candidate.
- The summary CSV is rebuilt from existing metadata.
- This allows safe re-running of the workflow without recomputing already processed candidates.
If a candidate prediction fails:
- The error is recorded in candidate_*/03_band/meta.json.
- The workflow continues with remaining candidates.
- Failed candidates appear in the summary CSV with NaN bandgap.
cutoff (float)
Neighbor cutoff radius (Å) used to construct the atomic graph for ALIGNN inference. Must be > 0.
max_neighbors (int)
Maximum number of neighbors retained per atom when building the graph. Must be > 0.
device (string, default: “cpu”)
Execution device:
"cpu""cuda"
gpu_id (integer, default: 0)
GPU index used when device = "cuda".
batch_size (integer, default: 32)
Batch size used for GPU inference.
n_workers (integer)
Number of CPU workers used for parallel bandgap prediction.
Notes
CPU mode → multiprocessing over structures using
n_workersGPU mode → batched inference using
batch_sizen_workersis ignored when using GPU
[formation]
Step 06 — Formation energy calculation.
Formation energies are computed using the chemical potentials written by
refs-build in reference_structures/reference_energies.json.
The reference scheme (metal or oxide) is automatically determined from
[references].reference_mode and no additional user input is required here.
skip_if_done (boolean)
If true, skip formation calculation if formation_energies.csv already exists
in a composition folder.
normalize (string)
Defines how the reported formation energy is normalized:
"total"— total formation energy (eV)"per_dopant"— eV per substituted dopant atom"per_host"— eV per atom in the supercell
Formation energies require:
Successful execution of
refs-buildreference_structures/reference_energies.jsonRelaxed candidate structures in
candidate_*/02_relax/
Notes
The same substitution formula is used for both reference schemes.
In
metalmode, elemental metal references define the chemical potentials.In
oxidemode, chemical potentials are derived from oxide references and the chosen oxygen condition (O-richorO-poor).The selected reference mode is stored in
candidate_*/04_formation/meta.jsonfor reproducibility.
[database]
Step 07 — Final database collection.
skip_if_done (boolean, default: true)
If true, do not overwrite existing results_database.csv.
[surface]
Step 08 — Surface generation and optional surface relaxation.
This stage constructs slab models from selected relaxed bulk candidates and optionally relaxes those slabs using machine-learning interatomic potentials.
The stage reads candidate data from a database file:
results_database.csv
For each selected candidate, the workflow loads the relaxed bulk structure from:
candidate_*/02_relax/POSCAR
and generates surface slabs for the requested orientations and terminations.
Generated slabs are written to:
[surface].outdir / composition_tag / candidate_### / hkl_* / term_*/
If enabled, slab relaxation is performed using the same backend abstraction as in Step 03.
Candidate Selection
Selection is performed in two stages:
Restrict dataset by composition:
composition_tagor
composition_tags
Apply
selection_modeinside the selected composition subset.
Available selection modes:
"id""ids""rank_range""top_n""filters"
Orientation Modes
orientation_mode (string)
"explicit"— use Miller indices frommiller_list"automatic"— generate indices up tomax_miller
miller_list (list)
List of Miller indices when using explicit mode.
Example:
[[1, 0, 0], [1, 1, 0], [1, 1, 1]]
max_miller (integer)
Maximum Miller index used in automatic mode.
max_orientations (integer)
Maximum number of orientations kept in automatic mode.
Slab Construction
min_slab_size (float)
Minimum slab thickness.
min_vacuum_size (float)
Minimum vacuum thickness.
center_slab (boolean)
If true, centers the slab in the simulation cell.
in_unit_planes (boolean)
Controls units of slab and vacuum thickness:
true→ values interpreted in crystallographic planesfalse→ values interpreted in Å (recommended)
lll_reduce (boolean)
Apply LLL lattice reduction.
primitive (boolean)
Build primitive slab if possible.
reorient_lattice (boolean)
Align slab normal with the out-of-plane axis.
orthogonal_c (boolean)
If true, enforces:
c perpendicular to (a, b)
This is strongly recommended for:
clean POSCAR output
stable relaxation
correct slab thickness interpretation
Surface Terminations
termination_mode (string)
"all"— keep all generated terminations"first"— keep only the first termination
max_terminations_per_orientation (integer)
Maximum number of terminations retained per orientation.
Fixed Atoms
The surface stage can optionally fix atoms in the slab.
Fixed atoms are written in the same POSCAR file using selective dynamics:
fixed →
F F Ffree →
T T T
fix_atoms (boolean)
Enable or disable fixed atoms.
fix_region (string)
"bottom"— fix bottom part of slab"middle"— fix central region
fix_method (string)
"layers"— fix a number of atomic layers"thickness"— fix a region of given thickness (Å)
fix_n_layers (integer)
Number of layers to fix when using layers mode.
fix_thickness_A (float)
Thickness of fixed region (Å) when using thickness mode.
fix_layer_tolerance_A (float)
Tolerance used to group atoms into layers along the z-direction.
Notes
Layer grouping is based on Cartesian z-coordinates.
Fixed atoms are enforced during relaxation using ASE constraints, not only written to the POSCAR.
Surface Relaxation
relax_surface (boolean)
Enable or disable slab relaxation.
Relaxation is performed using the same backend system as in Step 03.
surface_backend (string, default: “m3gnet”)
Available options:
"m3gnet""uma""mace""grace"
surface_model (string, default: “default”)
Model selection depends on backend (same behavior as Step 03).
surface_task (string, default: “”)
Optional task specification (used only for uma).
surface_optimizer (string, default: “bfgs”)
ASE optimizer:
"bfgs""lbfgs""fire""mdmin""quasinewton"
surface_fmax (float)
Maximum force convergence criterion (eV/Å).
surface_max_steps (integer, default: 300)
Maximum number of optimizer steps.
surface_device (string, default: “cpu”)
"cpu""cuda"
surface_gpu_id (integer, default: 0)
GPU index used when device = "cuda".
surface_tf_threads (integer)
TensorFlow thread count (for m3gnet).
surface_omp_threads (integer)
OpenMP thread count.
Notes
GPU mode uses a single effective worker.
CPU mode runs sequentially per slab (no multiprocessing currently).
Fixed atoms are enforced via ASE
FixAtoms.
Surface energy
When enabled, the surface stage also evaluates the surface energy of each generated slab.
The surface energy is computed using the standard symmetric slab expression:
where:
\(E_{\mathrm{slab}}\) is the total energy of the slab (typically from surface relaxation)
\(E_{\mathrm{bulk}}\) is the relaxed bulk energy of the parent candidate
\(n\) is the number of bulk-equivalent units contained in the slab
\(A\) is the surface area
the factor of 2 accounts for the two surfaces of the slab
Requirements
Surface energy is computed only when all of the following conditions are satisfied:
A slab energy is available (typically when
relax_surface = true)A bulk reference energy is available from the workflow database
The slab composition is proportional to the parent bulk composition
If any of these conditions are not met, the surface energy is not computed and a status flag is recorded instead.
Output fields
The following fields are written to:
meta.json(per surface)surface_summary.csv(global summary)
surface_energy_status (string)
Indicates whether surface energy was successfully computed.
Possible values include:
"ok"— surface energy successfully computed"missing_slab_energy"— slab energy not available"missing_bulk_energy"— bulk reference energy missing"not_computable_not_proportional"— slab composition not proportional to bulk"not_computable_missing_species""not_computable_extra_species""failed: <error>"— unexpected numerical or runtime failure
surface_energy_eV_A2 (float)
Surface energy expressed in eV/Ų.
surface_energy_J_m2 (float)
Surface energy expressed in J/m².
Conversion used:
surface_energy_reference_bulk_eV (float)
Bulk reference energy used in the calculation.
surface_energy_n_bulk_equiv (float)
Number of bulk-equivalent units contained in the slab.
This value is determined automatically from the ratio of slab and bulk compositions.
Notes
Surface energies are only well-defined for slabs whose composition is proportional to the parent bulk.
Non-stoichiometric terminations are not assigned a surface energy in the current implementation.
Future extensions may include chemical-potential-based surface energies for non-stoichiometric slabs.
Output
For each generated slab:
composition_tag/
candidate_###/
hkl_h_k_l/
term_###/
POSCAR
CONTCAR
meta.json
surface_relax.log
surface_relax.traj
surface_relax.json
Files:
POSCARGenerated slab structure (with optional selective dynamics)
CONTCARRelaxed slab structure (if relaxation is enabled)
meta.jsonSlab metadata
surface_relax.logRelaxation log
surface_relax.trajASE trajectory
surface_relax.jsonRelaxation metadata
Global output:
surface_summary.csv
Notes
min_slab_sizeandmin_vacuum_sizeshould be interpreted in Å whenin_unit_planes = false.orthogonal_c = trueis recommended for correct slab geometry.Relaxed structures are saved separately and do not overwrite the initial slab.
Design Principles
All stages are deterministic given fixed input.
All randomness is seed-controlled.
Each stage can be skipped independently.
Each stage writes metadata for full reproducibility.
The relaxed host supercell is constructed once and reused.