300 / 300
PASSED

105 test scenarios across 8 categories. Per-person tests (invariants, dual-view, edge cases) run for each of the 6 team members → 300 total checks. All passing.

Alice (Arch, 60h) ✓Bob (Dev, 80h) ✓Carol (Dev, 60h) ✓Dave (DevOps, 40h) ✓Eve (PM, 40h) ✓Frank (Dev, 40h) ✓
Σ

1. Mathematical Invariants — 9 scenarios × 6 persons = 54 checks

54/54 PASS

Fundamental mathematical guarantees of the EDPA engine — validated on every build. These tests verify that the model never breaks its key promises.

Each scenario runs per person: Alice, Bob, Carol, Dave, Eve, Frank — 54 checks total.
01 test_sum_equals_capacity PASS

Derived hours must match the declared capacity of the person.

Expected: Σ(hours) = capacity ± 0.01h
02 test_ratio_sum_equals_one PASS

Ratios must sum to 1.0 for every person with items.

Expected: Σ(ratio) = 1.0 ± 0.001
03 test_no_negative_hours PASS

No person may have negative derived hours.

Expected: All hours ≥ 0
04 test_no_negative_scores PASS

No score may be negative.

Expected: All scores ≥ 0
05 test_score_formula PASS

Score is calculated as JS multiplied by cw (per-item normalized share).

Expected: Score = JS × cw
06 test_per_item_cw_sums_to_one PASS

Per-item invariant: Σ cw across all persons on an item equals 1.0.

Expected: Σ_persons cw[*, item] = 1.0
07 test_capacity_invariant PASS

Per-person invariant: Σ derived hours equals declared capacity.

Expected: Σ_items DerivedHours[P, *] = Capacity[P, I] ± 0.01h
08 test_all_invariants_flag PASS

The invariant_ok flag must reflect the actual check results.

Expected: invariant_ok reflects actual checks
09 test_empty_items_no_crash PASS

A person with 0 items must get 0h, without crashing.

Expected: Person with 0 items → 0h, no crash
🔍

2. Evidence Detection — 10 scenarios

10/10 PASS

Verification of GitHub signal detection and additive aggregation into contribution_score. Signals add up; the resulting CW is the per-item normalized share (Σ across persons = 1.0 per item).

01 test_commit_author_signal PASS

Commit with S-XXX/F-XXX/E-XXX/I-XXX in branch/message adds signals.commit_author (default 4.00).

Expected: commit_author → +4.00 to score
02 test_pr_reviewer_signal PASS

Submitted PR review (excluding self) adds signals.pr_reviewer (default 2.17).

Expected: pr_reviewer → +2.17 to score
03 test_issue_comment_signal PASS

Issue/PR comment (bots excluded) adds signals.issue_comment (default 1.46).

Expected: issue_comment → +1.46 to score
04 test_signals_aggregate_additively PASS

When a person has commit + review + comment, weights sum (4.00 + 2.17 + 1.46 = 7.63); there is no highest-wins.

Expected: Σ fired signal_weights, not max
05 test_contribute_directive_additive PASS

/contribute @person weight:0.6 adds a manual:* signal with weight 0.6 — it does not override auto-detection.

Expected: /contribute @user weight:X → +X to score
06 test_per_item_normalization PASS

After aggregation, CW is normalized per item: cw = score / Σ persons. Invariant Σ cw[*, item] = 1.0.

Expected: Σ cw per item = 1.0 (±0.001)
07 test_branch_naming_extraction_S_F_E_I PASS

Branch regex extracts S-/F-/E-/I-XXX for Story, Feature, Epic, Initiative.

Expected: Detect: S-200, F-15, E-3, I-1
08 test_no_signals_no_contributor PASS

A person with no fired signal does not appear in contributors[] after aggregation.

Expected: score = 0 → excluded
09 test_bot_comments_filtered PASS

Comments from GitHub apps / bots do not count as issue_comment signals.

Expected: Bot login → signal does not fire
10 test_commit_count_no_time_effect PASS

Many commits by one author on an item count as a single commit_author signal (relevance, not volume).

Expected: 1 commit = 10 commits in terms of score

3. CW Heuristics — 15 scenarios

15/15 PASS

Verification of heuristic signal weight correctness and rules for determining Contribution Weight. The heuristic must be consistent and reproducible.

01 test_signal_weight_ordering PASS

Hierarchy: commit_author > pr_reviewer > issue_comment.

Expected: sw.commit_author >= sw.pr_reviewer >= sw.issue_comment
02 test_per_item_cw_normalization PASS

Per-item cw share: Σ cw across persons = 1.0 per item.

Expected: Σ_persons cw[*, item] = 1.0 (engine invariant)
03 test_per_person_capacity_invariant PASS

Per-person hours invariant: Σ hours = capacity_per_iteration.

Expected: Σ_items DerivedHours[P, *] = Capacity[P, I]
04 test_no_role_overrides_in_heuristics PASS

Calibration runs on 3 signal weights.

Expected: cw_heuristics.yaml.role_overrides not present
05 test_signal_weights_commit PASS

Commit author signal must have signal weight 4.00 in cw_heuristics.yaml.

Expected: signals.commit_author = 4.00
06 test_signal_weights_pr_reviewer PASS

PR reviewer signal must have signal weight 2.17 in cw_heuristics.yaml.

Expected: signals.pr_reviewer = 2.17
07 test_signal_weights_issue_comment PASS

Issue comment signal must have signal weight 1.46 in cw_heuristics.yaml.

Expected: signals.issue_comment = 1.46
08 test_signals_aggregate_additively PASS

CW is computed by additive signal aggregation + per-item normalization. No highest-signal-wins.

Expected: cw[P, item] = Σ signal_weights / Σ_persons Σ signal_weights
09 test_contribute_directive_additive PASS

Manual /contribute @person weight:X adds a manual:* signal with weight X — it does not override auto-detection.

Expected: cw_after = (cw_auto_score + X) / Σ_persons score
10 test_per_item_invariant PASS

Σ cw across persons on a single item = 1.0 (engine invariant, ±0.001).

Expected: Σ_persons cw[*, item] = 1.0
11 test_cw_range_0_1 PASS

CW is a per-item normalized share, always in [0, 1.0]. No fixed per-signal floor/ceiling.

Expected: 0 ≤ cw[P, item] ≤ 1.0
12 test_no_negative_contribution_score PASS

contribution_score sums positive signal weights — no signal subtracts.

Expected: contribution_score ≥ 0 always
13 test_per_item_independence PASS

A person's CW on item A is independent of CW on item B (per-item normalization).

Expected: cw[P, A] independent of cw[P, B]
14 test_per_person_independence PASS

Two persons' CW on the same item are independent (both come from their own signal aggregations).

Expected: cw[P1, item] independent of cw[P2, item]
15 test_strategic_role_via_signal_calibration PASS

PM/BO/Arch contribution is captured via issue_comment + pr_reviewer + manual /contribute. Per-role multipliers are not used — bias is handled by calibrating signal weights against ground truth.

Expected: No role_overrides in heuristics

4. Dual-View Consistency — 12 scenarios × 6 persons = 72 checks

72/72 PASS

EDPA provides two views — per-person and per-item. Both must be mutually consistent and sums must match in both directions.

Each scenario runs per person: Alice, Bob, Carol, Dave, Eve, Frank — 72 checks total.
01 test_per_person_sum_equals_capacity PASS

Per-person view: sum of derived hours = capacity for every person.

Expected: Σ DerivedHours[P, *] = Capacity[P]
02 test_per_item_shares_sum_100 PASS

Per-item view: sum of shares of all contributors = 100% for every item.

Expected: Σ shares[*, item] = 100%
03 test_same_cw_same_results_both_views PASS

Same CW must produce same results in both views.

Expected: per-person hours consistent with per-item shares
04 test_no_transitions_degenerates_to_done_credit PASS

When git history records no Feature/Epic/Initiative transitions, the engine credits only Story Done items.

Expected: gate_events empty → only Story Done credit fires
05 test_per_person_hours_sum_cross_items PASS

Per-person: hours on item X + hours on all other items = total capacity.

Expected: hours[P, X] + hours[P, rest] = capacity[P]
06 test_zero_contribution_excluded_both_views PASS

Items with zero contribution do not appear in either view.

Expected: zero contribution → absent in both views
07 test_single_contributor_full_share PASS

Single contributor on an item gets 100% share in per-item view.

Expected: single contributor → 100% share
08 test_two_equal_contributors_equal_split PASS

Two contributors with equal CW get 50/50 split in per-item view.

Expected: equal CW → 50/50 share split
09 test_capacity_no_affect_per_item_share PASS

Different capacities do not affect percentage share in per-item view.

Expected: capacity[P1] ≠ capacity[P2] → share% unchanged
10 test_cross_check_hours_vs_capacities PASS

Cross-check: sum of all per-item hours across all items ≤ sum of all capacities.

Expected: ΣΣ hours[P, item] ≤ Σ capacity[P]
11 test_three_contributors_weighted_split PASS

Three contributors with CW 1.0, 0.6, 0.25 — shares match weight ratio.

Expected: 1.0:0.6:0.25 → 54%:32%:14% share
12 test_per_item_hours_sum_matches_js_proportion PASS

Sum of hours on an item from all persons matches the Job Size proportion in total budget.

Expected: item hours reflect JS weight in total budget

5. Edge Cases — 18 scenarios × 6 persons = 108 checks

108/108 PASS

Boundary cases and extreme scenarios that the EDPA engine must handle without crashing, with correct results and no precision loss.

Each scenario runs per person: Alice, Bob, Carol, Dave, Eve, Frank — 108 checks total.
01 test_person_zero_relevant_items PASS

Person with 0 relevant items must get 0 hours without crashing.

Expected: 0 items → 0h, no crash
02 test_person_single_item_full_capacity PASS

Person with one item must get full capacity.

Expected: 1 item → hours = capacity
03 test_all_items_same_job_size PASS

All items with the same Job Size — hours distributed only by CW.

Expected: same JS → distribution by CW only
04 test_all_people_same_cw_on_item PASS

All people with the same CW on an item — hours proportional to capacity.

Expected: same CW → hours proportional to capacity
05 test_job_size_zero_excluded PASS

Item with Job Size = 0 must be excluded from calculation (no division by zero).

Expected: JS = 0 → item excluded, no division by zero
06 test_single_person_team PASS

Single-person team: person gets full capacity regardless of CW.

Expected: single person → full capacity
07 test_hundred_items_capacity_sum PASS

100 items for one person — capacity must still sum correctly.

Expected: 100 items: Σ hours = capacity
08 test_max_job_size_allocation PASS

Maximum Job Size (20) must produce correct proportional allocation.

Expected: JS = 20 → correct proportional allocation
09 test_min_job_size_allocation PASS

Minimum Job Size (1) must produce correct proportional allocation.

Expected: JS = 1 → correct proportional allocation
10 test_all_cw_equal_distribution PASS

All CW = 1.0 — hours distributed equally by Job Size.

Expected: all CW = 1.0 → equal distribution per JS
11 test_very_unequal_capacities PASS

Very unequal capacities (10h vs 160h) — each person sums to their own capacity.

Expected: 10h + 160h: each sums to own capacity
12 test_floating_point_precision PASS

Float precision: sum must be within 0.01h tolerance of capacity.

Expected: Σ within 0.01h tolerance
13 test_unicode_item_titles PASS

Unicode characters in item titles must not cause processing errors.

Expected: Unicode titles → no processing errors
14 test_empty_iteration_graceful PASS

Empty iteration (no stories) must be handled without crashing.

Expected: empty iteration → graceful handling
15 test_person_only_epic_feature PASS

Person only on Epic/Feature (no stories) must still get allocation.

Expected: Epic/Feature only → still gets allocation
16 test_negative_job_size_rejected PASS

Negative Job Size must be rejected — no negative allocation.

Expected: JS < 0 → item rejected
17 test_duplicate_person_on_item_no_double_count PASS

Duplicate signals from the same person on an item must not double the allocation.

Expected: duplicate signals → single CW entry
18 test_large_team_scaling PASS

20+ people in a team — calculation still converges and invariants hold.

Expected: 20+ people: all invariants hold
🔄

6. Auto-calibration — 12 scenarios

12/12 PASS

Verification of the auto-calibration system inspired by Karpathy's autoresearch pattern. Calibration must be safe, reproducible, and efficient.

01 test_minimum_ground_truth_records PASS

Calibration requires a minimum of 20 manually confirmed CW records.

Expected: len(ground_truth) ≥ 20 required
02 test_mad_computation_correctness PASS

MAD (Mean Absolute Deviation) is correctly computed as mean of |auto_cw - confirmed_cw|.

Expected: MAD = mean(|auto_cw - confirmed_cw|)
03 test_lower_mad_better PASS

Lower MAD = better heuristic. Optimization direction must be "lower is better".

Expected: direction: lower is better
04 test_calibrator_locked PASS

Calibrator (calibrate_signals.py) is locked — synthetic corpus + MAD cost function live in one file. The agent must not modify it (gaming).

Expected: calibrate_signals.py: LOCKED, read-only
05 test_single_change_per_iteration PASS

Each experiment changes only one parameter — isolation of change effects.

Expected: one parameter change per experiment
06 test_git_commit_after_experiment PASS

A git commit is made after each experiment — memory = git log.

Expected: git commit after each experiment
07 test_revert_on_worse_mad PASS

When MAD worsens, the experiment is reverted (git reset --hard HEAD~1).

Expected: MAD worse → git revert
08 test_keep_on_better_or_equal_mad PASS

When MAD improves or stays the same, the experiment is kept.

Expected: MAD better/equal → keep commit
09 test_budget_50_experiments_max PASS

Maximum budget is 50 experiments — protection against infinite loops.

Expected: budget ≤ 50 experiments
10 test_expected_improvement_range PASS

Expected improvement is 15–30% MAD reduction after 50 experiments.

Expected: expected: 15-30% MAD reduction
11 test_ground_truth_format_validation PASS

Ground truth records must contain: item_id, person_id, evidence_role, auto_cw, confirmed_cw.

Expected: required fields: item_id, person_id, evidence_role, auto_cw, confirmed_cw
12 test_no_data_leakage PASS

No data leakage between training and validation sets — strict separation.

Expected: no data leakage between train/validation
🔒

7. Governance & Audit — 17 scenarios

17/17 PASS

Verification of audit trail, freeze rules, governance processes, and compliance requirements. EDPA must be fully auditable and reproducible.

01 test_snapshot_frozen_after_close PASS

Snapshot is frozen after Iteration Close — must not be modified.

Expected: snapshot.frozen = true after close
02 test_frozen_snapshot_immutable PASS

Frozen snapshot must not be modified in-place.

Expected: frozen snapshot: no in-place modification
03 test_corrections_create_new_revision PASS

Corrections create a new revision (_rev2, _rev3), never overwrite the original.

Expected: correction → new revision (_rev2, _rev3)
04 test_snapshot_required_fields PASS

Snapshot must contain all 10 required top-level keys.

Expected: 10 required keys present in snapshot
05 test_branch_naming_enforced PASS

Branch naming convention: {type}/{ITEM-ID}-description must be enforced.

Expected: branch: {type}/{ITEM-ID}-description
06 test_pr_references_work_item PASS

PR must reference a work item (S-XXX, F-XXX, E-XXX) in title or body.

Expected: PR references: S-XXX, F-XXX, or E-XXX
07 test_traceability_chain PASS

Full traceability chain: Initiative → Epic → Feature → Story → PR → Commit.

Expected: Initiative → Epic → Feature → Story → PR → Commit
08 test_wsjf_calculation PASS

WSJF is correctly calculated as (BV + TC + RR&amp;OE) / JS.

Expected: WSJF = (BV + TC + RR&amp;OE) / JS
09 test_job_size_guardrails_story PASS

Job Size guardrails for Story: JS ≤ 8 (ideally ≤ 5).

Expected: Story JS ≤ 8 (recommended ≤ 5)
10 test_job_size_guardrails_feature PASS

Job Size guardrails for Feature: JS ≤ 13.

Expected: Feature JS ≤ 13
11 test_job_size_guardrails_epic PASS

Job Size guardrails for Epic: JS ≤ 20.

Expected: Epic JS ≤ 20
12 test_dor_checklist_validation PASS

Definition of Ready checklist: description, AC, estimate, parent linked.

Expected: DoR: description, AC, estimate, parent linked
13 test_dod_checklist_validation PASS

Definition of Done checklist: code reviewed, tests passed, PR merged.

Expected: DoD: code reviewed, tests passed, PR merged
14 test_wip_limit_enforcement PASS

WIP limit: ideally 1 Story per person at any given time.

Expected: WIP limit: 1 Story per person (ideal)
15 test_bankid_signature_support PASS

BankID electronic signature support (Act 21/2020 Coll.).

Expected: BankID signature: Act 21/2020 Coll.
16 test_reproducible_calculation PASS

Reproducible calculation: same inputs must always produce same outputs.

Expected: same inputs → same outputs (deterministic)
17 test_audit_trail_five_pillars PASS

Audit trail covers 5 pillars: GitHub evidence, capacity, snapshot, reproducible calculation, signature.

Expected: 5 pillars: evidence, capacity, snapshot, calc, signature
📈

8. Capacity Planning — 12 scenarios

12/12 PASS

Verification of the Iteration Planning Protocol — planning_factor as a team-level property, 80% rule, buffer usage tracking, and capacity commitment workflow.

01 test_planning_factor_team_level PASS

planning_factor must be a team-level property, not a cadence or person-level property.

Expected: teams[].planning_factor (not cadence, not person)
02 test_planning_factor_default PASS

Default planning_factor must be 0.8 (plan to 80% of total capacity).

Expected: planning_factor default = 0.8
03 test_planning_factor_range PASS

planning_factor must be in range (0, 1.0] — never zero, never above 100%.

Expected: 0 < planning_factor ≤ 1.0
04 test_planning_capacity_formula PASS

Planning Capacity = Total Capacity × planning_factor for each team.

Expected: Planning_Capacity = Σ Capacity[P] × planning_factor
05 test_different_teams_different_factors PASS

Different teams may have different planning_factor values.

Expected: teams[A].planning_factor ≠ teams[B].planning_factor allowed
06 test_edpa_uses_total_not_planning PASS

EDPA calculation always uses Total Capacity (100%), not Planning Capacity.

Expected: DerivedHours uses Capacity[P], not Planning_Capacity
07 test_buffer_absorbs_unplanned PASS

Buffer (20% by default) absorbs support, maintenance, incidents, and unplanned work.

Expected: buffer = Total - Planning → unplanned work
08 test_unplanned_items_generate_evidence PASS

Unplanned items in the buffer generate evidence and are allocated normally by EDPA.

Expected: unplanned items → evidence → normal EDPA allocation
09 test_capacity_confirmed_at_planning PASS

Each team member must confirm availability at Iteration Planning (availability: confirmed).

Expected: availability = confirmed required
10 test_planning_factor_no_affect_invariant PASS

planning_factor must not affect the mathematical guarantee Σ DerivedHours = Capacity.

Expected: planning_factor → no effect on Σ = Capacity
11 test_buffer_usage_metric PASS

Buffer_Usage metric tracks how much of the reserve was consumed by unplanned work.

Expected: Buffer_Usage = unplanned / (Total - Planning) × 100%
12 test_high_buffer_usage_warning PASS

Consistently high buffer usage (>90%) should trigger a warning to adjust capacity or scope.

Expected: Buffer_Usage > 90% → warning

Auto-calibration (Karpathy loop)

Automatic calibration system inspired by Karpathy's autoresearch pattern. One file, one metric, one loop.

Configuration
Targetcw_heuristics.yaml.tmpl (signals: block)
MetricMAD (Mean Absolute Deviation) on the MC corpus
Directionlower is better
Budget2000 MC samples + coordinate descent on top-5 (~10s)
Search space3D signal weights, each in [0.1, 8.0]
Calibratorcalibrate_signals.py (LOCKED)
Expected results
  • Typical improvement: 15–30% MAD reduction
  • After 50 experiments: heuristic matches real team patterns
  • Diminishing returns after ~30 experiments
  • Prerequisite: ≥ 20 manually confirmed CW records

Loop

  1. Run python calibrate_signals.py --scenarios 1000 --seed 42
  2. The script will:
    1. Generate a synthetic Monte Carlo corpus (1,000 scenarios × ~31k records) procedurally from a model where signal counts probabilistically reflect each person's true cw share
    2. Compute baseline MAD against shipped defaults (commit_author=4.00, pr_reviewer=2.17, issue_comment=1.46)
    3. Phase 1 — random sampling: 2000 random weight vectors in the 3D space [0.1, 8.0], sorted by MAD
    4. Phase 2 — coordinate descent: refines the top-5 candidates, tries ±step per signal, halves step on no-improvement
    5. Return best calibrated weights + MAD improvement %
    6. With --apply, rewrite cw_heuristics.yaml.tmpl + refresh the calibration: metadata
  3. Print summary: baseline MAD, calibrated MAD, % improvement, top weights
  4. (Optional) --report report.json dumps the full run for audit

Safety constraints

The calibrator is LOCKED — the agent must not edit calibrate_signals.py. The synthetic corpus generator + MAD cost function live in one file by design; separation by structure prevents gaming.
No parameters inside the cost functionevaluate_mad() takes only a weight vector and pure-reads signal_count × weight with per-item normalization.

Escalation strategy

ExperimentsFocusParameters
1–30signal weights (Monte Carlo)3 parameters, random sampling
31–50signal weights (Nelder-Mead refinement)local descent around MC top-K

CW Heuristic

Default weights for automatic Contribution Weight assignment based on GitHub signals.

Derived role labels Display layer only

EDPA doesn't store roles per person — derives them from dominant signal type for UX (timesheets, reports). Engine math sees only cw values.

Dominant signalDisplay role
commit_authorowner
manual:*key
pr_reviewerreviewer
issue_commentconsulted
Signal weights
SignalScore
commit_author4.00
pr_reviewer2.17
issue_comment1.46
manual:* (/contribute)explicit
Rule: Signals add additively into contribution_score per (person, item). Per-item normalization gives cw = score / Σ_persons score. No "highest signal wins", no threshold — even one comment yields a proportional share.
Monte Carlo calibration

1,000 synthetic scenarios (32,210 records); 5 candidates converged to MAD 0.0805 after Nelder-Mead refinement.

Strategic-role bias correction: EDPA addresses strategic-role bias purely via signal weight calibration:

  • BO/PM contributions show via issue_comment + manual /contribute directives. Calibrator boosts issue_comment weight if BO/PM are under-credited.
  • Arch contributions show via pr_reviewer. Calibrator boosts pr_reviewer weight similarly.
  • Dev/QA reference baseline — Git accurately captures their work.
  • Edge-case generator simulates pm_driven, pair_programmed, design_heavy, silent_reviewer patterns so calibration generalizes.

Note: Pre-pilot baseline is from synthetic scenarios. After kashealth pilot's first PI close, calibration runs against real ground truth (≥20 confirmed cw records).

Method comparison

Criterion EDPA v2.11.1 Manual timesheets Fixed allocation
Accuracy High Medium Low
Effort Minimal High None
Auditability Full Partial None
Dual-view Yes No No
Math. guarantee Σ = capacity None Complex
Automation GitHub Actions Manual Partial

Demo calculation

Static demonstration of EDPA calculation for 3 people and 5 work items. Operational variant (Simple mode).

Capacity

PersonFTECapacity (h)
Alice0.540h
Bob1.080h
Carol0.7560h

Work items & assignments (CW per-item normalized, Σ per row = 1.0)

ItemJSAlice (CW)Bob (CW)Carol (CW)Σ
S-10150.700.301.00
S-10230.550.451.00
S-10380.650.351.00
S-10420.300.701.00
S-10551.001.00

Score calculation (Score = JS × CW)

ItemJSAlice ScoreBob ScoreCarol Score
S-10153.501.50
S-10231.651.35
S-10385.202.80
S-10420.601.40
S-10555.00
Σ5.158.659.20

Derived Hours (DH = Score / ΣScore × Capacity)

ItemAlice (40h)Bob (80h)Carol (60h)
S-10127.18h13.87h
S-10212.82h12.49h
S-10348.09h18.26h
S-1045.55h9.13h
S-10532.61h
Σ40.00h80.00h60.00h
Alice
Σ = 40.00h
Capacity: 40h
VERIFIED
Bob
Σ = 80.00h
Capacity: 80h
VERIFIED
Carol
Σ = 60.00h
Capacity: 60h
VERIFIED
Σ DerivedHours[P, *] = Capacity[P, I] Holds for every person. Always. No exceptions.