Workflow Overview

Conceptual Pipeline

The ML Doping Workflow implements a fully automated, multi-stage surrogate pipeline for the exploration of doped crystalline materials.

It combines symmetry-aware structure generation with machine-learned interatomic potentials to efficiently screen large configurational spaces.

The workflow is designed to first identify promising bulk candidates and then optionally extend the analysis to surface structures.

Pipeline Structure

Reference Construction → Enumeration → Screening → Relaxation → Filtering → Band Gap → Formation Energy → Database

Optional Post-Processing:

Database → Surface Generation → Surface Relaxation

Stages

  1. Reference construction and relaxation - Relax host structure (unit cell and supercell) - Relax reference phases (metal or oxide mode) - Build thermodynamic reference dataset

  2. Symmetry-reduced dopant enumeration - Generate substitutional doped configurations - Identify symmetry-unique arrangements on the cation sublattice

  3. ML-based energy screening - Evaluate single-point energies using a selected ML backend - Supports: M3GNet, UMA, MACE, GRACE - Exact enumeration or stochastic sampling

  4. Structure relaxation - Relax candidate structures using ML forces - Uses ASE optimizers (e.g. BFGS, FIRE, LBFGS) - CPU or GPU execution

  5. Energy-based filtering - Select low-energy candidates - Window-based or top-N selection strategies

  6. Band gap prediction - Predict electronic band gaps using ALIGNN

  7. Formation energy evaluation - Compute formation energies using reference structures - Supports metal and oxide reference schemes

  8. Database assembly - Aggregate results across all stages - Export a unified CSV database

  9. Surface generation (optional) - Select candidates from the database - Generate slab structures for chosen Miller indices - Enumerate surface terminations - Optionally fix atoms in the slab

  10. Surface relaxation (optional) - Relax slab structures using ML interatomic potentials - Apply atom constraints (e.g. fixed bottom layers) - Use the same backend abstraction as bulk relaxation

Design Principles

  • Modular: Each stage can be executed independently

  • Backend-agnostic: Multiple ML potentials are supported

  • Reproducible: Fully controlled via input.toml

  • Scalable: Supports multiprocessing and GPU execution

  • Extensible: New models and stages can be added easily

Notes

  • The core workflow (Stages 0–7) focuses on bulk screening and database generation.

  • Surface generation is intentionally decoupled from the main pipeline and is executed separately.

  • This design allows users to: - inspect and validate bulk candidates before surface modeling - control the number of generated slabs - avoid combinatorial explosion of surface structures

Typical Usage

A typical workflow consists of:

  1. Running the full bulk pipeline:

    dopingflow run-all -c input.toml
    
  2. Inspecting the resulting database:

    results_database.csv
    
  3. Generating and optionally relaxing surfaces:

    dopingflow surface -c input.toml