7. Database Collection

Implementation

This stage is implemented in:

src/dopingflow/collect.py

The public entry point is:

run_collect(...)

Purpose

This stage consolidates results from previous workflow stages into a single flat CSV that can be used as a lightweight database for downstream analysis, plotting, and reporting.

The database is written to the workflow root as:

results_database.csv

Only filtered / selected candidates are included (strict policy), so the database represents the subset of structures that passed your selection criteria.

Inputs

This stage uses settings from two sections of input.toml:

[structure]: provides the workflow output directory containing composition folders.
[database]: controls skipping/overwriting of the database CSV.

It expects the standard workflow outputs inside each composition folder (if present):

Composition-level (folder-level)

metadata.json (from Step 01 generation)
selected_candidates.txt (from Step 04 filtering) preferred
ranking_relax_filtered.csv (from Step 04 filtering) fallback
ranking_scan.csv (from Step 02 scanning)
bandgap_alignn_summary.csv (from Step 05 bandgap prediction)
formation_energies.csv (from Step 06 formation energies)

Candidate-level

candidate_*/02_relax/meta.json (from Step 03 relaxation)

Selection Policy

This stage follows a strict selection rule: it will never include unfiltered candidates.

Selection priority within each composition folder:

If selected_candidates.txt exists: use exactly those candidate names (one per line).
Else, if ranking_relax_filtered.csv exists: include the candidates listed there.
Else: the folder is skipped (no candidates are collected).

This ensures that the database represents only the candidates you explicitly kept after filtering.

Method Summary

For each composition folder inside [structure].outdir:

Determine the candidate list using the selection policy above.
Load composition metadata from metadata.json (if available).
Load per-stage summary tables (if available):
- scan ranking table
- filtered ranking table
- bandgap summary table
- formation energy table
For each selected candidate:
1. Read relaxed-energy metadata from candidate_*/02_relax/meta.json (if available).
2. Combine composition-level and candidate-level information into one row.
Write all rows to:

results_database.csv

Database Schema

The output database contains the following columns:

Composition-level fields

composition_tag: Folder name of the composition (effective tag used by the generator).
requested_index: Index of the composition in the generator loop (if available).
requested_pct_json: JSON string of requested dopant percentages (if available).
effective_pct_json: JSON string of effective dopant percentages after rounding (if available).
rounded_counts_json: JSON string of rounded integer substitution counts (if available).
host_species: Host species used for substitution (from generation metadata).
n_host: Number of host sites in the supercell (from generation metadata).
supercell_json: JSON string of the supercell dimensions (if available).

Candidate-level identity

candidate: Candidate folder name (e.g., candidate_001).
candidate_path: Absolute path to the candidate folder (useful for linking back to files).

Relax (filtered) fields

rank_relax_filtered: Rank within the filtered set (if available).
E_relaxed_eV_filtered: Relaxed energy read from the filtered ranking table (if available).

Scan fields

rank_scan: Single-point rank from scanning stage (if available).
E_scan_eV: Single-point energy from scanning stage (if available).

Relax (meta) fields

E_relaxed_eV: Relaxed energy extracted from candidate_*/02_relax/meta.json (if available).

Bandgap fields

bandgap_eV: Predicted bandgap value from the bandgap summary (if available).

Formation-energy fields

E_form_eV_total: Total formation energy in eV (if available).
E_form_norm: Normalized formation energy (as written by Step 06, depends on your normalization mode).
n_dopant_atoms: Total dopant atoms used in the candidate (if available).
dopant_counts: Compact dopant-count string from Step 06 (if available).

Outputs

This stage writes one file in the workflow root:

results_database.csv

Each row corresponds to one selected candidate in one composition folder.

Reproducibility and Skipping

If:

[database].skip_if_done = true

and results_database.csv already exists, the stage is skipped.

Set skip_if_done = false to overwrite and regenerate the database.

Notes and Limitations

The database is intentionally flat and file-based; it is designed for quick use in pandas, spreadsheets, or plotting scripts.
Missing upstream files are handled gracefully: columns are left empty/None when a stage output is unavailable.
Only the filtered/selected candidate subset is included by design. If you later want a “full database” including all candidates, the selection policy can be relaxed as an optional mode.