7. Database Collection
Implementation
This stage is implemented in:
src/dopingflow/collect.py
The public entry point is:
run_collect(...)
Purpose
This stage consolidates results from previous workflow stages into a single flat CSV that can be used as a lightweight database for downstream analysis, plotting, and reporting.
The database is written to the workflow root as:
results_database.csv
Only filtered / selected candidates are included (strict policy), so the database represents the subset of structures that passed your selection criteria.
Inputs
This stage uses settings from two sections of input.toml:
[structure]: provides the workflow output directory containing composition folders.[database]: controls skipping/overwriting of the database CSV.
It expects the standard workflow outputs inside each composition folder (if present):
Composition-level (folder-level)
metadata.json(from Step 01 generation)selected_candidates.txt(from Step 04 filtering) preferredranking_relax_filtered.csv(from Step 04 filtering) fallbackranking_scan.csv(from Step 02 scanning)bandgap_alignn_summary.csv(from Step 05 bandgap prediction)formation_energies.csv(from Step 06 formation energies)
Candidate-level
candidate_*/02_relax/meta.json(from Step 03 relaxation)
Selection Policy
This stage follows a strict selection rule: it will never include unfiltered candidates.
Selection priority within each composition folder:
If
selected_candidates.txtexists: use exactly those candidate names (one per line).Else, if
ranking_relax_filtered.csvexists: include the candidates listed there.Else: the folder is skipped (no candidates are collected).
This ensures that the database represents only the candidates you explicitly kept after filtering.
Method Summary
For each composition folder inside [structure].outdir:
Determine the candidate list using the selection policy above.
Load composition metadata from
metadata.json(if available).Load per-stage summary tables (if available):
scan ranking table
filtered ranking table
bandgap summary table
formation energy table
For each selected candidate:
Read relaxed-energy metadata from
candidate_*/02_relax/meta.json(if available).Combine composition-level and candidate-level information into one row.
Write all rows to:
results_database.csv
Database Schema
The output database contains the following columns:
Composition-level fields
composition_tag: Folder name of the composition (effective tag used by the generator).requested_index: Index of the composition in the generator loop (if available).requested_pct_json: JSON string of requested dopant percentages (if available).effective_pct_json: JSON string of effective dopant percentages after rounding (if available).rounded_counts_json: JSON string of rounded integer substitution counts (if available).host_species: Host species used for substitution (from generation metadata).n_host: Number of host sites in the supercell (from generation metadata).supercell_json: JSON string of the supercell dimensions (if available).
Candidate-level identity
candidate: Candidate folder name (e.g.,candidate_001).candidate_path: Absolute path to the candidate folder (useful for linking back to files).
Relax (filtered) fields
rank_relax_filtered: Rank within the filtered set (if available).E_relaxed_eV_filtered: Relaxed energy read from the filtered ranking table (if available).
Scan fields
rank_scan: Single-point rank from scanning stage (if available).E_scan_eV: Single-point energy from scanning stage (if available).
Relax (meta) fields
E_relaxed_eV: Relaxed energy extracted fromcandidate_*/02_relax/meta.json(if available).
Bandgap fields
bandgap_eV: Predicted bandgap value from the bandgap summary (if available).
Formation-energy fields
E_form_eV_total: Total formation energy in eV (if available).E_form_norm: Normalized formation energy (as written by Step 06, depends on your normalization mode).n_dopant_atoms: Total dopant atoms used in the candidate (if available).dopant_counts: Compact dopant-count string from Step 06 (if available).
Outputs
This stage writes one file in the workflow root:
results_database.csv
Each row corresponds to one selected candidate in one composition folder.
Reproducibility and Skipping
If:
[database].skip_if_done = true
and results_database.csv already exists, the stage is skipped.
Set skip_if_done = false to overwrite and regenerate the database.
Notes and Limitations
The database is intentionally flat and file-based; it is designed for quick use in pandas, spreadsheets, or plotting scripts.
Missing upstream files are handled gracefully: columns are left empty/
Nonewhen a stage output is unavailable.Only the filtered/selected candidate subset is included by design. If you later want a “full database” including all candidates, the selection policy can be relaxed as an optional mode.