# H_u / H_s model sweep

A computationally-minimal sweep of the PID-over-idiom-contexts pipeline across
four large base LMs plus a gpt2 baseline, both scoring reductions, and both
context-modes, over the idiom dataset and its parallel literal-VP (non-idiom)
dataset.

**Read first:** [`INTERPRETATION.md`](INTERPRETATION.md) — how to read every
magnitude and its direction (which way is "more synergy", why the original
`H_s` is `+inf`, what the new finite synergy metrics mean). The single-page
[`report.html`](report.html) embeds the same guide plus every figure.

## What was swept

| axis | values |
|---|---|
| model | `google/gemma-2-9b`, `Qwen/Qwen3-8B-Base`, `Qwen/Qwen3-8B`, `meta-llama/Llama-3.1-8B` (bf16), `gpt2` baseline (fp32) |
| reduction | `geometric_mean` (`geo`) · `joint` |
| context-mode | `medial` (`--medial-only`, canonical) · `full` (all contexts) |
| dataset | idiom (`data/dataset.tsv`) and non-idiom (`data/nonidioms_dataset.tsv`), 18 phrases each |

**5 × 2 × 2 = 20 configurations**, each scored over both datasets (40 JSONs).

## Metrics (per phrase, averaged over its contexts; see INTERPRETATION.md)

With `p` = idiom score, `m = max{q,r}` = better component word:

| key | meaning | direction |
|---|---|---|
| `entropy_idiom` `H(p)` | base entropy | ↓ smaller = more concentrated |
| `ratio_u_idiom` `H_u/H` | unique-information ratio (≥1) | ↑ **more synergy** (headline) |
| `syn_frac_idiom` | frac. contexts with `p>m` (0..1) | ↑ **more synergy** (most intuitive) |
| `h_s_log_idiom` `H_s^log` | `mean max{0, log p − log m}` | ↑ **more synergy** (recommended; finite) |
| `ratio_s_log_idiom` | `H_s^log / H(p)` | ↑ more synergy |
| `h_s_log_signed_idiom` | `mean(log p − log m)` (signed) | ↑ more synergy (can be <0) |
| `h_s_reg_idiom` `H_s^reg` | ε-floored finite/continuous `H_s` (ε=0.01) | ↓ **less synergy**; finite |
| `ratio_s_reg_idiom` | `H_s^reg / H(p)` (≥1) | ↓ less synergy |
| `h_s_idiom` `H_s` | original `mean −log max{0,p−m}` | ↓ less synergy; **+inf** if any compositional context |
| `ratio_s_idiom` | `H_s / H(p)` | original; mostly +inf |

Each JSON row also stores the raw per-context vectors `p_ctx`, `q_ctx`, `r_ctx`,
so new metrics can be derived offline **without re-running the models**.

## Layout & naming

Every artifact is keyed `{slug}__{mode}__{reduction}`:
slug ∈ `gpt2 | gemma2-9b | qwen3-8b-base | qwen3-8b | llama3.1-8b`,
mode ∈ `medial | full`, reduction ∈ `geo | joint`.

- `json/{key}__{idiom,nonidiom}.json` — raw output (header + per-phrase rows + p/q/r vectors).
- `reports/{key}.md` — comprehensive per-metric idiom-vs-nonidiom report (means, 95% bootstrap CIs, cross-dataset gap) for **every** metric + full per-phrase tables.
- `reports/{key}__analyze.txt` — verbatim `analyze_hcov.py` output.
- `figures/{key}/` — up to 27 PNGs: `{01_strip,02_bars,03_per_idiom}_{ratio_u,h_u,syn_frac,h_s_log,ratio_s_log,h_s_reg,ratio_s_reg,h_s,ratio_s}.png` (the original `h_s`/`ratio_s` are absent when +inf everywhere).
- `figures_aggregate/` — cross-config figures:
  - `agg_01_forest_ratio_u_gap` — Δ(H_u/H) ± 95% CI, every config (headline)
  - `agg_08_peridiom_ratio_u` — per-phrase H_u/H across the 4 study models, **idioms vs non-idioms** with a separator
  - `agg_07_model_means_ratio_u` — per-model mean H_u/H gap (incl. gpt2)
  - `agg_02_grouped_ratio_u`, `agg_04_grouped_entropy` — idiom-vs-nonidiom bars
  - `agg_09_forest_ratio_s_log_gap`, `agg_10_grouped_syn_frac`, `agg_11_grouped_ratio_s_log` — the new finite synergy metrics
- `SUMMARY.md` — master cross-config comparison, one table per metric.
- `report.html` — single-page navigable report (TOC, interpretation guide, all figures, click-to-zoom).
- `logs/` — per-run logs + driver logs.

## Regenerating (compute-smart)

The model forward pass is the only expensive step, and geo+joint+medial+full are
all derivable from one pass per sentence. So:

```bash
bash run_sweep_cached.sh   # GPU: each model loaded once, each sentence scored once -> json/
bash regen_reports.sh      # CPU: analyze + reports + plots + summary + aggregates + report.html
```

`run_sweep_cached.sh` (via `code/sweep_cached.py`) reproduces `code/main.py`'s
numbers bit-for-bit but scores each distinct sentence exactly once, ~3–4× less
GPU than calling `main.py` per config. `regen_reports.sh` is pure post-processing
from the JSONs — rerun it freely after any plotting/report change.

## Caveats

- `gpt2` is a small-model **baseline**; it appears in the data, summary tables,
  per-config galleries, and the per-model gap plot (`agg_07`). The 4 big models
  are the study set; `agg_08` is restricted to them.
- `full` mode includes final-position sentences (empty right context); the
  canonical analysis and the README math assume `medial`.
- All non-baseline models are **base** (non-instruct) LMs.
- JSON stores Python `Infinity` (not strict JSON); the analysis scripts handle it.
