2. Symmetry-Reduced Energy Pre-screening (Scanning)

Implementation

This stage is implemented in:

src/dopingflow/scan.py

The public entry point is:

run_scan(...)

Purpose

This stage performs an efficient energy pre-screening of dopant arrangements by enumerating symmetry-unique configurations on a selected sublattice and evaluating their single-point energies using a machine-learned interatomic potential.

For each generated structure folder, the method:

infers the substitutional sublattice and dopant counts from the input structure
generates dopant configurations either by exact enumeration or random sampling
removes symmetry-equivalent configurations
predicts a single-point energy for each configuration using the selected machine-learning backend
keeps the top-k lowest-energy candidates for downstream relaxation

Inputs

This stage uses settings from the following sections of input.toml:

[structure]: provides the output directory containing generated structures.
[doping]: provides the host species definition.
[generate]: provides the species ordering used when writing POSCAR files.
[scan]: controls enumeration limits, symmetry tolerance, parallelism, and selection.

Method Summary

For each structure subdirectory in [structure].outdir:

Read the structure file specified by [scan].poscar_in.
Identify the enumeration sublattice (all non-anion sites) and infer: - host species count on the sublattice - dopant species counts on the sublattice
Estimate the total number of raw (non-symmetry-reduced) configurations.
Construct a parent structure and compute symmetry operations acting on the sublattice.
Decide the scan strategy according to [scan].mode:
- exact: perform full symmetry-unique enumeration
- sample: generate configurations using random symmetry-unique sampling
- auto: use enumeration when feasible, otherwise switch to sampling
Generate configurations:
- enumeration: iterate over all dopant permutations
- sampling: randomly generate dopant arrangements and filter duplicates
Reduce configurations to symmetry-unique representations.
Evaluate single-point energies in parallel using the selected machine-learning backend.
Select the lowest-energy topk candidates and write them to candidate_*/01_scan.
Write a CSV ranking file and a human-readable summary.

Backend and Model Selection

The scan stage supports multiple machine-learning backends for energy evaluation.

The backend is selected via [scan].backend and controls how energies are computed.

Supported backends include:

m3gnet: - TensorFlow-based universal interatomic potential - robust and general-purpose
uma: - FAIR-Chem universal models (via Hugging Face) - requires authentication and model access permissions
mace: - foundation models trained on Materials Project / OMAT / MPA datasets - fast and efficient, especially for large-scale screening
grace: - graph neural network models for atomistic simulations (if installed) - experimental integration

Model selection is controlled via [scan].model.

Examples:

backend = "mace"
model = "small"

backend = "uma"
model = "uma-s-1p2"
task  = "omat"

Notes:

task is required only for the UMA backend.
For other backends, task is ignored.
Models are typically downloaded and cached automatically on first use.

Backend	Framework	GPU	Strength	Notes
m3gnet	TensorFlow	Yes	Stable, general	Slower on CPU
uma	PyTorch	Yes	High accuracy	Needs HF access
mace	PyTorch	Yes	Fast, scalable	Best for scan
grace	PyTorch	Yes	Advanced GNN	Experimental

Enumeration Sublattice Definition

The enumerated sublattice is inferred from the input structure using the rule:

anion sites are excluded (defined by [scan].anion_species)
all remaining sites form the enumeration sublattice

This design makes the method general across crystals where doping occurs on a cation (or non-anion) sublattice.

The dopant counts are inferred directly from the input POSCAR by counting species on the enumeration sublattice. The host species is provided by [doping].host_species.

Combinatorics and Safety Limits

Raw configuration count

Given a sublattice with \(N\) sites and fixed dopant counts, the number of raw configurations (before symmetry reduction) grows combinatorially.

To avoid runaway enumeration, the workflow:

estimates the raw count and enforces [scan].max_enum
enforces a hard limit on symmetry-unique configurations via [scan].max_unique

If mode = "auto" and the estimated configuration count exceeds [scan].max_enum, the scan automatically switches to sampling mode.

If mode = "exact", exceeding these limits stops the scan to prevent infeasible enumeration.

This protects against infeasible compositions and/or large supercells.

Maximum number of dopant species per scan

The current symmetry-unique enumerator is implemented explicitly for up to three distinct dopant species on the enumerated sublattice.

If the inferred dopant set contains more than three species, the scan raises an error.

This limit applies to the scanning enumeration procedure (not necessarily to other stages).

Symmetry Reduction

To avoid evaluating symmetry-equivalent configurations, the scan:

Constructs a parent structure where all sites of the enumeration sublattice are assigned to the host species.
Uses a space-group analysis (with tolerance [scan].symprec) to obtain symmetry operations.
Converts symmetry operations into permutations of sublattice indices.
For each dopant labeling, computes a canonical representation under these permutations.
Keeps only labelings with unique canonical keys.

This yields the set of symmetry-unique dopant configurations.

Sampling Mode

When mode = "sample" (or when mode = "auto" switches to sampling), the workflow generates configurations by random sampling instead of full combinatorial enumeration.

The algorithm:

randomly assigns dopants to sublattice sites
converts the labeling to a canonical symmetry representation
discards duplicates already encountered
accumulates unique configurations until a batch is formed
evaluates the batch using the selected machine-learning backend

Sampling is controlled by the parameters:

sample_budget — maximum number of sampling attempts
sample_batch_size — number of unique structures evaluated per batch
sample_patience — early stopping when no better structures are found
sample_seed — random seed for reproducibility
sample_max_saved — maximum stored canonical keys to avoid duplicates

Sampling allows the scan to explore large configuration spaces that would otherwise be infeasible to enumerate.

Energy Evaluation and Parallel Execution

Each symmetry-unique configuration is evaluated using a single-point energy prediction with the selected ML backend.

Key points:

structures are not relaxed at this stage
only energies are predicted (fast screening)
predictions are backend-dependent

Backend-specific behavior:

m3gnet: - TensorFlow-based - GPU mode uses a single worker for stability
uma: - PyTorch-based - supports CPU and GPU execution
mace: - PyTorch-based foundation model - typically executed efficiently in single-process mode - model is loaded once and reused across all configurations
grace: - PyTorch-based (if available) - performance depends on model implementation

Parallel execution:

CPU mode uses multiprocessing (n_workers)
GPU mode may restrict parallelism depending on backend
worker processes are created using the spawn method for robustness

Selection of Top-k Candidates

The workflow keeps the [scan].topk configurations with the lowest predicted single-point energies and writes each to:

<structure_dir>/candidate_###/01_scan/POSCAR
<structure_dir>/candidate_###/01_scan/meta.json

A CSV ranking file is written to the structure directory:

<structure_dir>/ranking_scan.csv

The ranking CSV contains:

candidate name
rank (single-point)
predicted single-point energy
a short dopant-position signature

Reproducibility

For a fixed input structure and scan settings:

enumeration mode produces deterministic results
sampling mode is stochastic but reproducible when sample_seed is fixed
inferred dopant counts
symmetry analysis (controlled by symprec)
enumeration ordering
the ML model used for prediction

Parallel evaluation does not affect the final ranking, since candidates are selected based on energy values.

Outputs

For each processed structure folder:

candidate_*/01_scan/POSCAR: POSCAR files of selected low-energy candidates
candidate_*/01_scan/meta.json: metadata describing scan settings and counts
ranking_scan.csv: ranked list of candidates with predicted energies
scan_summary.txt: human-readable summary of the scan

Notes and Limitations

This stage performs single-point ML energy screening only; it does not relax structures.
Symmetry reduction depends on the tolerance symprec; too large values may merge distinct configurations, too small values may reduce symmetry detection.
Enumeration can become infeasible for large sublattices and/or high dopant counts; limits max_enum and max_unique are enforced to prevent runaway runtime/memory.
The current label enumerator supports up to three distinct dopant species.
Predicted energies are surrogate ML values and should be interpreted as a ranking heuristic rather than absolute thermodynamic energies.
Different backends provide different accuracy/speed trade-offs:
- m3gnet: robust general-purpose model
- uma: high-quality FAIR-Chem models (requires access permissions)
- mace: fast foundation models suitable for large-scale screening
- grace: experimental GNN-based models
Sampling mode does not guarantee discovery of the global minimum but efficiently identifies low-energy candidates for subsequent relaxation.