7. Database Collection
======================

Implementation
--------------

This stage is implemented in:

::

   src/dopingflow/collect.py

The public entry point is:

::

   run_collect(...)


Purpose
-------

This stage consolidates results from previous workflow stages into a **single flat CSV**
that can be used as a lightweight database for downstream analysis, plotting, and reporting.

The database is written to the workflow root as:

::

   results_database.csv

Only **filtered / selected candidates** are included (strict policy), so the database
represents the subset of structures that passed your selection criteria.


Inputs
------

This stage uses settings from two sections of ``input.toml``:

- ``[structure]``: provides the workflow output directory containing composition folders.
- ``[database]``: controls skipping/overwriting of the database CSV.

It expects the standard workflow outputs inside each composition folder (if present):

Composition-level (folder-level)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- ``metadata.json`` (from Step 01 generation)
- ``selected_candidates.txt`` (from Step 04 filtering) **preferred**
- ``ranking_relax_filtered.csv`` (from Step 04 filtering) **fallback**
- ``ranking_scan.csv`` (from Step 02 scanning)
- ``bandgap_alignn_summary.csv`` (from Step 05 bandgap prediction)
- ``formation_energies.csv`` (from Step 06 formation energies)

Candidate-level
~~~~~~~~~~~~~~~

- ``candidate_*/02_relax/meta.json`` (from Step 03 relaxation)


Selection Policy
----------------

This stage follows a **strict selection rule**: it will never include unfiltered candidates.

Selection priority within each composition folder:

1. If ``selected_candidates.txt`` exists:
   use exactly those candidate names (one per line).

2. Else, if ``ranking_relax_filtered.csv`` exists:
   include the candidates listed there.

3. Else:
   the folder is skipped (no candidates are collected).

This ensures that the database represents only the candidates you explicitly kept
after filtering.


Method Summary
--------------

For each composition folder inside ``[structure].outdir``:

1. Determine the candidate list using the selection policy above.
2. Load composition metadata from ``metadata.json`` (if available).
3. Load per-stage summary tables (if available):

   - scan ranking table
   - filtered ranking table
   - bandgap summary table
   - formation energy table

4. For each selected candidate:

   a. Read relaxed-energy metadata from ``candidate_*/02_relax/meta.json`` (if available).
   b. Combine composition-level and candidate-level information into one row.

5. Write all rows to:

::

   results_database.csv


Database Schema
---------------

The output database contains the following columns:

Composition-level fields
~~~~~~~~~~~~~~~~~~~~~~~~

- ``composition_tag``:
  Folder name of the composition (effective tag used by the generator).

- ``requested_index``:
  Index of the composition in the generator loop (if available).

- ``requested_pct_json``:
  JSON string of requested dopant percentages (if available).

- ``effective_pct_json``:
  JSON string of effective dopant percentages after rounding (if available).

- ``rounded_counts_json``:
  JSON string of rounded integer substitution counts (if available).

- ``host_species``:
  Host species used for substitution (from generation metadata).

- ``n_host``:
  Number of host sites in the supercell (from generation metadata).

- ``supercell_json``:
  JSON string of the supercell dimensions (if available).

Candidate-level identity
~~~~~~~~~~~~~~~~~~~~~~~~

- ``candidate``:
  Candidate folder name (e.g., ``candidate_001``).

- ``candidate_path``:
  Absolute path to the candidate folder (useful for linking back to files).

Relax (filtered) fields
~~~~~~~~~~~~~~~~~~~~~~~

- ``rank_relax_filtered``:
  Rank within the filtered set (if available).

- ``E_relaxed_eV_filtered``:
  Relaxed energy read from the filtered ranking table (if available).

Scan fields
~~~~~~~~~~~

- ``rank_scan``:
  Single-point rank from scanning stage (if available).

- ``E_scan_eV``:
  Single-point energy from scanning stage (if available).

Relax (meta) fields
~~~~~~~~~~~~~~~~~~~

- ``E_relaxed_eV``:
  Relaxed energy extracted from ``candidate_*/02_relax/meta.json`` (if available).

Bandgap fields
~~~~~~~~~~~~~~

- ``bandgap_eV``:
  Predicted bandgap value from the bandgap summary (if available).

Formation-energy fields
~~~~~~~~~~~~~~~~~~~~~~~

- ``E_form_eV_total``:
  Total formation energy in eV (if available).

- ``E_form_norm``:
  Normalized formation energy (as written by Step 06, depends on your normalization mode).

- ``n_dopant_atoms``:
  Total dopant atoms used in the candidate (if available).

- ``dopant_counts``:
  Compact dopant-count string from Step 06 (if available).


Outputs
-------

This stage writes one file in the workflow root:

::

   results_database.csv

Each row corresponds to **one selected candidate** in **one composition folder**.


Reproducibility and Skipping
----------------------------

If:

::

   [database].skip_if_done = true

and ``results_database.csv`` already exists, the stage is skipped.

Set ``skip_if_done = false`` to overwrite and regenerate the database.


Notes and Limitations
---------------------

- The database is intentionally flat and file-based; it is designed for quick use
  in pandas, spreadsheets, or plotting scripts.
- Missing upstream files are handled gracefully: columns are left empty/``None``
  when a stage output is unavailable.
- Only the filtered/selected candidate subset is included by design.
  If you later want a “full database” including all candidates, the selection
  policy can be relaxed as an optional mode.