Input File Specification (input.toml)

The workflow is fully controlled through a single TOML configuration file.

Each stage reads only the parameters relevant to it.

Execution Model

Each stage defines both:

  • what it does (physics / workflow logic)

  • how it runs (CPU / GPU / parallelization)

There is no global hardware section. Instead:

  • [scan] controls screening

  • [relax] controls structural relaxation

  • [bandgap] controls band gap prediction

Each stage independently defines:

  • device (cpu or cuda)

  • gpu_id (for GPU execution)

This design gives full flexibility and avoids cross-stage conflicts.

Execution Behavior

Stage

Device

Strategy

scan

CPU

multiprocessing

relax

CPU / GPU

workers or single GPU

bandgap

CPU

multiprocessing

bandgap

GPU

batched inference

This design allows each stage to control its own parallelization strategy while sharing a consistent hardware configuration.

All paths are interpreted relative to the directory containing input.toml.

The following sections are supported:

  • [references]

  • [structure]

  • [doping]

  • [generate]

  • [scan]

  • [relax]

  • [filter]

  • [bandgap]

  • [formation]

  • [database] (optional)

Not all sections are required for every stage. Each stage reads only what it needs.


[references]

Step 00 — Reference construction and relaxation.

This stage prepares all thermodynamic reference structures and writes:

reference_structures/reference_energies.json

It performs:

  • Relaxation of the host oxide unit cell

  • Construction and relaxation of the host supercell

  • Relaxation of metal reference phases (metal mode)

  • Relaxation of oxide reference phases (oxide mode)

  • Relaxation of O₂ gas (oxide mode)

  • Storage of all relaxed POSCAR files for reuse

Common Parameters

reference_mode (string)

Choose reference scheme:

  • "metal"

  • "oxide"

skip_if_done (boolean)

Skip reconstruction if JSON cache exists.

fmax (float)

Force convergence criterion used in relaxation (eV/Å).

supercell (array of 3 integers)

Supercell used later for doping. The host supercell is constructed and relaxed at this stage.

host (string)

Chemical formula of the host oxide (e.g. "SnO2").

host_dir (string)

Directory containing <host>.POSCAR.

Metal Reference Mode

Used when reference_mode = "metal".

metal_ref (array of strings)

List of metal element symbols used as reference phases.

metals_dir (string)

Directory containing <Element>.POSCAR files.

Example:

reference_structures/metals/Sn.POSCAR
reference_structures/metals/Sb.POSCAR

Oxide Reference Mode

Used when reference_mode = "oxide".

oxides_ref (array of strings)

List of dopant oxide formulas (e.g. "Sb2O5").

oxides_dir (string)

Directory containing oxide POSCAR files.

gas_ref (string)

Gas reference formula (typically "O2").

gas_dir (string)

Directory containing gas POSCAR file.

oxygen_mode (string)

Currently supports:

  • "O-rich"

  • "O-poor"

muO_shift_ev (float)

Optional chemical potential shift applied to oxygen (eV).


[structure]

Defines workflow I/O only.

outdir (string, default: “random_structures”)

Directory where generated composition folders are written.

This directory becomes the root for:

  • Step 01 outputs

  • Step 02–06 per-composition subfolders


[doping]

Defines substitutional doping behavior.

host_species (string, required)

Element symbol of the host species to be substituted.

mode (string)

Defines doping mode:

  • "explicit" — user provides exact compositions.

  • "enumerate" — workflow constructs compositions combinatorially.

Explicit Mode

compositions (array of tables)

List of dictionaries:

compositions = [
    { Sb = 5 },
    { Sb = 5, Zr = 5 }
]

Percentages are defined relative to host sites.

Enumerate Mode

dopants (array of strings)

Allowed dopant elements.

must_include (array of strings)

Dopants that must appear in each composition.

max_dopants_total (integer)

Maximum number of distinct dopants per structure.

allowed_totals (array of floats)

Allowed total dopant percentages.

levels (array of floats)

Discrete concentration values per dopant.


[generate]

Step 01 — Structure generation.

This step generates one doped structure per composition by substituting host atoms inside the relaxed host supercell produced by refs-build.

Important

  • refs-build must be executed first.

  • The relaxed host supercell is loaded from:

    reference_structures/reference_energies.json

  • No supercell is constructed in this step.

Parameters

seed_base (integer)

Base seed used for deterministic random substitution.

Each composition generates a stable hash-based seed.

poscar_order (array of strings, optional)

Defines element ordering in written POSCAR files.

If empty, pymatgen default ordering is used.

Example:

poscar_order = ["Zr", "Ti", "Sb", "Sn", "O"]

clean_outdir (boolean, default: true)

If true, existing output directory is deleted before writing new structures.

Output

For each composition:

<outdir>/<composition_tag>/
    POSCAR
    metadata.json

[scan]

Step 02 — Dopant configuration prescreening using M3GNet.

For each generated structure folder inside [structure].outdir:

  1. Generates doped configurations on the cation sublattice

  2. Identifies symmetry-unique configurations

  3. Evaluates single-point energies using M3GNet

  4. Ranks configurations by energy

  5. Keeps the lowest-energy topk candidates

  6. Writes candidate folders and ranking files

Depending on the scan mode, configurations are obtained either by:

  • exact symmetry-unique enumeration

  • random symmetry-unique sampling

This step operates only on subfolders created in Step 01.

Parameters

poscar_in (string, default: “POSCAR”)

Filename inside each composition folder used as input.

topk (integer)

Number of lowest-energy configurations retained.

symprec (float)

Tolerance used for symmetry detection in SpacegroupAnalyzer.

mode (string, default: “auto”)

Scan strategy.

Possible values:

  • auto Automatically chooses exact enumeration for manageable problems and switches to sampling when the configuration space becomes too large.

  • exact Forces full symmetry-unique enumeration of all configurations.

  • sample Uses random symmetry-unique sampling instead of full enumeration.

max_enum (integer)

Maximum allowed number of raw combinatorial configurations in exact mode.

If this limit is exceeded and mode = "auto", the scan automatically switches to sampling mode.

max_unique (integer)

Maximum allowed number of symmetry-unique configurations in exact mode.

Prevents excessive memory usage for very large configuration spaces.

device (string, default: “cpu”)

Execution device:

  • "cpu"

  • "cuda"

Controls how M3GNet energies are evaluated.

n_workers (integer)

Number of parallel worker processes.

  • Used only when device = "cpu"

  • Ignored when device = "cuda" (GPU mode uses a single worker)

chunksize (integer)

Chunk size used in the multiprocessing pool.

Relevant only when device = "cpu".

gpu_id (integer, default: 0)

GPU index used when device = "cuda".

anion_species (array of strings)

Species excluded from substitutional enumeration. Typically contains oxygen:

anion_species = ["O"]

host_species (from [doping])

Used to define the cation sublattice. Must match the host element defined in [doping].

skip_if_done (boolean, default: true)

If true, skip a structure folder if ranking_scan.csv already exists.

Sampling parameters

Used when:

  • mode = "sample"

  • or mode = "auto" selects sampling for large configuration spaces

Maximum number of random sampling attempts.

sample_batch_size (integer)

Number of new symmetry-unique sampled configurations evaluated per batch.

sample_patience (integer)

Sampling stops after this many sampled configurations fail to improve the current best candidate.

sample_seed (integer)

Random seed used for reproducible sampling.

sample_max_saved (integer)

Maximum number of sampled canonical configurations stored to avoid duplicates.

Notes

  • If device = "cuda", scan runs on a single GPU and n_workers is ignored.

  • If device = "cpu", parallelization is controlled via n_workers and chunksize.

  • GPU mode is recommended for faster single-structure evaluation, while CPU mode scales better across many configurations.

Output

For each composition folder:

ranking_scan.csv
scan_summary.txt
candidate_001/01_scan/POSCAR
candidate_001/01_scan/meta.json
...

Each candidate folder contains:

  • symmetry-unique configuration

  • single-point M3GNet energy

  • dopant site signature

  • scan metadata

[relax]

Step 03 — Structural relaxation.

Relaxes the symmetry-selected candidates using the pretrained M3GNet Relaxer. For each structure folder in [structure].outdir, the candidates from candidate_*/01_scan/POSCAR are relaxed in parallel.

fmax (float)

Maximum force convergence criterion (eV/Å). Relaxation stops when the maximum atomic force falls below this threshold.

device (string, default: “cpu”)

Execution device:

  • "cpu"

  • "cuda"

gpu_id (integer, default: 0)

GPU index used when device = "cuda".

n_workers (integer)

Number of parallel relaxation workers (one candidate per worker process).

tf_threads (integer)

TensorFlow thread count per worker. Keep small (typically 1) when using multiple workers.

omp_threads (integer)

OpenMP thread count per worker. Keep small to avoid CPU oversubscription.

skip_if_done (boolean)

Skip an entire composition folder if ranking_relax.csv already exists.

skip_candidate_if_done (boolean)

Skip an individual candidate if candidate_*/02_relax/meta.json and POSCAR already exist.

Notes

  • If device = "cuda", relaxation runs on a single GPU and n_workers is ignored.

  • If device = "cpu", parallelization is controlled via n_workers.

  • Species ordering in the relaxed POSCAR follows [generate].poscar_order.

  • If poscar_order is empty, the default pymatgen ordering is used.


[filter]

Step 04 — Candidate selection.

mode (string)

  • "window"

  • "topn"

window_meV (float)

Energy window above the lowest relaxed energy (in meV).

Candidates with:

E_relaxed <= E_min + window_meV

are retained.

A value of 0 keeps only the lowest-energy structure.

If no candidate satisfies the filtering criteria, the workflow raises an error.

max_candidates (integer)

Number of candidates kept when mode = "topn".

skip_if_done (boolean)

Skip filtering if output exists.


[bandgap]

Step 05 — Bandgap prediction using a local ALIGNN model.

Requires environment variable ALIGNN_MODEL_DIR pointing to a local ALIGNN model directory.

skip_if_done (bool)

If true, previously computed bandgap results are reused.

Behavior: - If candidate_*/03_band/meta.json already exists, the stored bandgap value is reused and prediction is skipped for that candidate. - The summary CSV is rebuilt from existing metadata. - This allows safe re-running of the workflow without recomputing already processed candidates.

If a candidate prediction fails: - The error is recorded in candidate_*/03_band/meta.json. - The workflow continues with remaining candidates. - Failed candidates appear in the summary CSV with NaN bandgap.

cutoff (float)

Neighbor cutoff radius (Å) used to construct the atomic graph for ALIGNN inference. Must be > 0.

max_neighbors (int)

Maximum number of neighbors retained per atom when building the graph. Must be > 0.

device (string, default: “cpu”)

Execution device:

  • "cpu"

  • "cuda"

gpu_id (integer, default: 0)

GPU index used when device = "cuda".

batch_size (integer, default: 32)

Batch size used for GPU inference.

n_workers (integer)

Number of CPU workers used for parallel bandgap prediction.

Notes

  • CPU mode → multiprocessing over structures using n_workers

  • GPU mode → batched inference using batch_size

  • n_workers is ignored when using GPU


[formation]

Step 06 — Formation energy calculation.

Formation energies are computed using the chemical potentials written by refs-build in reference_structures/reference_energies.json.

The reference scheme (metal or oxide) is automatically determined from [references].reference_mode and no additional user input is required here.

skip_if_done (boolean)

If true, skip formation calculation if formation_energies.csv already exists in a composition folder.

normalize (string)

Defines how the reported formation energy is normalized:

  • "total" — total formation energy (eV)

  • "per_dopant" — eV per substituted dopant atom

  • "per_host" — eV per atom in the supercell

Formation energies require:

  • Successful execution of refs-build

  • reference_structures/reference_energies.json

  • Relaxed candidate structures in candidate_*/02_relax/

Notes

  • The same substitution formula is used for both reference schemes.

  • In metal mode, elemental metal references define the chemical potentials.

  • In oxide mode, chemical potentials are derived from oxide references and the chosen oxygen condition (O-rich or O-poor).

  • The selected reference mode is stored in candidate_*/04_formation/meta.json for reproducibility.


[database]

Step 07 — Final database collection.

skip_if_done (boolean, default: true)

If true, do not overwrite existing results_database.csv.


Design Principles

  • All stages are deterministic given fixed input.

  • All randomness is seed-controlled.

  • Each stage can be skipped independently.

  • Each stage writes metadata for full reproducibility.

  • The relaxed host supercell is constructed once and reused.