7. Database Collection

Implementation

This stage is implemented in:

src/dopingflow/collect.py

The public entry point is:

run_collect(...)

Purpose

This stage consolidates results from previous workflow stages into a single flat CSV that can be used as a lightweight database for downstream analysis, plotting, and reporting.

The database is written to the workflow root as:

results_database.csv

Only filtered / selected candidates are included (strict policy), so the database represents the subset of structures that passed your selection criteria.

Inputs

This stage uses settings from two sections of input.toml:

  • [structure]: provides the workflow output directory containing composition folders.

  • [database]: controls skipping/overwriting of the database CSV.

It expects the standard workflow outputs inside each composition folder (if present):

Composition-level (folder-level)

  • metadata.json (from Step 01 generation)

  • selected_candidates.txt (from Step 04 filtering) preferred

  • ranking_relax_filtered.csv (from Step 04 filtering) fallback

  • ranking_scan.csv (from Step 02 scanning)

  • bandgap_alignn_summary.csv (from Step 05 bandgap prediction)

  • formation_energies.csv (from Step 06 formation energies)

Candidate-level

  • candidate_*/02_relax/meta.json (from Step 03 relaxation)

Selection Policy

This stage follows a strict selection rule: it will never include unfiltered candidates.

Selection priority within each composition folder:

  1. If selected_candidates.txt exists: use exactly those candidate names (one per line).

  2. Else, if ranking_relax_filtered.csv exists: include the candidates listed there.

  3. Else: the folder is skipped (no candidates are collected).

This ensures that the database represents only the candidates you explicitly kept after filtering.

Method Summary

For each composition folder inside [structure].outdir:

  1. Determine the candidate list using the selection policy above.

  2. Load composition metadata from metadata.json (if available).

  3. Load per-stage summary tables (if available):

    • scan ranking table

    • filtered ranking table

    • bandgap summary table

    • formation energy table

  4. For each selected candidate:

    1. Read relaxed-energy metadata from candidate_*/02_relax/meta.json (if available).

    2. Combine composition-level and candidate-level information into one row.

  5. Write all rows to:

results_database.csv

Database Schema

The output database contains the following columns:

Composition-level fields

  • composition_tag: Folder name of the composition (effective tag used by the generator).

  • requested_index: Index of the composition in the generator loop (if available).

  • requested_pct_json: JSON string of requested dopant percentages (if available).

  • effective_pct_json: JSON string of effective dopant percentages after rounding (if available).

  • rounded_counts_json: JSON string of rounded integer substitution counts (if available).

  • host_species: Host species used for substitution (from generation metadata).

  • n_host: Number of host sites in the supercell (from generation metadata).

  • supercell_json: JSON string of the supercell dimensions (if available).

Candidate-level identity

  • candidate: Candidate folder name (e.g., candidate_001).

  • candidate_path: Absolute path to the candidate folder (useful for linking back to files).

Relax (filtered) fields

  • rank_relax_filtered: Rank within the filtered set (if available).

  • E_relaxed_eV_filtered: Relaxed energy read from the filtered ranking table (if available).

Scan fields

  • rank_scan: Single-point rank from scanning stage (if available).

  • E_scan_eV: Single-point energy from scanning stage (if available).

Relax (meta) fields

  • E_relaxed_eV: Relaxed energy extracted from candidate_*/02_relax/meta.json (if available).

Bandgap fields

  • bandgap_eV: Predicted bandgap value from the bandgap summary (if available).

Formation-energy fields

  • E_form_eV_total: Total formation energy in eV (if available).

  • E_form_norm: Normalized formation energy (as written by Step 06, depends on your normalization mode).

  • n_dopant_atoms: Total dopant atoms used in the candidate (if available).

  • dopant_counts: Compact dopant-count string from Step 06 (if available).

Outputs

This stage writes one file in the workflow root:

results_database.csv

Each row corresponds to one selected candidate in one composition folder.

Reproducibility and Skipping

If:

[database].skip_if_done = true

and results_database.csv already exists, the stage is skipped.

Set skip_if_done = false to overwrite and regenerate the database.

Notes and Limitations

  • The database is intentionally flat and file-based; it is designed for quick use in pandas, spreadsheets, or plotting scripts.

  • Missing upstream files are handled gracefully: columns are left empty/None when a stage output is unavailable.

  • Only the filtered/selected candidate subset is included by design. If you later want a “full database” including all candidates, the selection policy can be relaxed as an optional mode.