Tests and evaluation of EDPA v2.11.1
300 verification checks — 105 scenarios × 6 team members
105 test scenarios across 8 categories. Per-person tests (invariants, dual-view, edge cases) run for each of the 6 team members → 300 total checks. All passing.
1. Mathematical Invariants — 9 scenarios × 6 persons = 54 checks
Fundamental mathematical guarantees of the EDPA engine — validated on every build. These tests verify that the model never breaks its key promises.
01 test_sum_equals_capacity PASS
Derived hours must match the declared capacity of the person.
Σ(hours) = capacity ± 0.01h 02 test_ratio_sum_equals_one PASS
Ratios must sum to 1.0 for every person with items.
Σ(ratio) = 1.0 ± 0.001 03 test_no_negative_hours PASS
No person may have negative derived hours.
All hours ≥ 0 04 test_no_negative_scores PASS
No score may be negative.
All scores ≥ 0 05 test_score_formula PASS
Score is calculated as JS multiplied by cw (per-item normalized share).
Score = JS × cw 06 test_per_item_cw_sums_to_one PASS
Per-item invariant: Σ cw across all persons on an item equals 1.0.
Σ_persons cw[*, item] = 1.0 07 test_capacity_invariant PASS
Per-person invariant: Σ derived hours equals declared capacity.
Σ_items DerivedHours[P, *] = Capacity[P, I] ± 0.01h 08 test_all_invariants_flag PASS
The invariant_ok flag must reflect the actual check results.
invariant_ok reflects actual checks 09 test_empty_items_no_crash PASS
A person with 0 items must get 0h, without crashing.
Person with 0 items → 0h, no crash 2. Evidence Detection — 10 scenarios
Verification of GitHub signal detection and additive aggregation into contribution_score. Signals add up; the resulting CW is the per-item normalized share (Σ across persons = 1.0 per item).
01 test_commit_author_signal PASS
Commit with S-XXX/F-XXX/E-XXX/I-XXX in branch/message adds signals.commit_author (default 4.00).
commit_author → +4.00 to score 02 test_pr_reviewer_signal PASS
Submitted PR review (excluding self) adds signals.pr_reviewer (default 2.17).
pr_reviewer → +2.17 to score 03 test_issue_comment_signal PASS
Issue/PR comment (bots excluded) adds signals.issue_comment (default 1.46).
issue_comment → +1.46 to score 04 test_signals_aggregate_additively PASS
When a person has commit + review + comment, weights sum (4.00 + 2.17 + 1.46 = 7.63); there is no highest-wins.
Σ fired signal_weights, not max 05 test_contribute_directive_additive PASS
/contribute @person weight:0.6 adds a manual:* signal with weight 0.6 — it does not override auto-detection.
/contribute @user weight:X → +X to score 06 test_per_item_normalization PASS
After aggregation, CW is normalized per item: cw = score / Σ persons. Invariant Σ cw[*, item] = 1.0.
Σ cw per item = 1.0 (±0.001) 07 test_branch_naming_extraction_S_F_E_I PASS
Branch regex extracts S-/F-/E-/I-XXX for Story, Feature, Epic, Initiative.
Detect: S-200, F-15, E-3, I-1 08 test_no_signals_no_contributor PASS
A person with no fired signal does not appear in contributors[] after aggregation.
score = 0 → excluded 09 test_bot_comments_filtered PASS
Comments from GitHub apps / bots do not count as issue_comment signals.
Bot login → signal does not fire 10 test_commit_count_no_time_effect PASS
Many commits by one author on an item count as a single commit_author signal (relevance, not volume).
1 commit = 10 commits in terms of score 3. CW Heuristics — 15 scenarios
Verification of heuristic signal weight correctness and rules for determining Contribution Weight. The heuristic must be consistent and reproducible.
01 test_signal_weight_ordering PASS
Hierarchy: commit_author > pr_reviewer > issue_comment.
sw.commit_author >= sw.pr_reviewer >= sw.issue_comment 02 test_per_item_cw_normalization PASS
Per-item cw share: Σ cw across persons = 1.0 per item.
Σ_persons cw[*, item] = 1.0 (engine invariant) 03 test_per_person_capacity_invariant PASS
Per-person hours invariant: Σ hours = capacity_per_iteration.
Σ_items DerivedHours[P, *] = Capacity[P, I] 04 test_no_role_overrides_in_heuristics PASS
Calibration runs on 3 signal weights.
cw_heuristics.yaml.role_overrides not present 05 test_signal_weights_commit PASS
Commit author signal must have signal weight 4.00 in cw_heuristics.yaml.
signals.commit_author = 4.00 06 test_signal_weights_pr_reviewer PASS
PR reviewer signal must have signal weight 2.17 in cw_heuristics.yaml.
signals.pr_reviewer = 2.17 07 test_signal_weights_issue_comment PASS
Issue comment signal must have signal weight 1.46 in cw_heuristics.yaml.
signals.issue_comment = 1.46 08 test_signals_aggregate_additively PASS
CW is computed by additive signal aggregation + per-item normalization. No highest-signal-wins.
cw[P, item] = Σ signal_weights / Σ_persons Σ signal_weights 09 test_contribute_directive_additive PASS
Manual /contribute @person weight:X adds a manual:* signal with weight X — it does not override auto-detection.
cw_after = (cw_auto_score + X) / Σ_persons score 10 test_per_item_invariant PASS
Σ cw across persons on a single item = 1.0 (engine invariant, ±0.001).
Σ_persons cw[*, item] = 1.0 11 test_cw_range_0_1 PASS
CW is a per-item normalized share, always in [0, 1.0]. No fixed per-signal floor/ceiling.
0 ≤ cw[P, item] ≤ 1.0 12 test_no_negative_contribution_score PASS
contribution_score sums positive signal weights — no signal subtracts.
contribution_score ≥ 0 always 13 test_per_item_independence PASS
A person's CW on item A is independent of CW on item B (per-item normalization).
cw[P, A] independent of cw[P, B] 14 test_per_person_independence PASS
Two persons' CW on the same item are independent (both come from their own signal aggregations).
cw[P1, item] independent of cw[P2, item] 15 test_strategic_role_via_signal_calibration PASS
PM/BO/Arch contribution is captured via issue_comment + pr_reviewer + manual /contribute. Per-role multipliers are not used — bias is handled by calibrating signal weights against ground truth.
No role_overrides in heuristics 4. Dual-View Consistency — 12 scenarios × 6 persons = 72 checks
EDPA provides two views — per-person and per-item. Both must be mutually consistent and sums must match in both directions.
01 test_per_person_sum_equals_capacity PASS
Per-person view: sum of derived hours = capacity for every person.
Σ DerivedHours[P, *] = Capacity[P] 02 test_per_item_shares_sum_100 PASS
Per-item view: sum of shares of all contributors = 100% for every item.
Σ shares[*, item] = 100% 03 test_same_cw_same_results_both_views PASS
Same CW must produce same results in both views.
per-person hours consistent with per-item shares 04 test_no_transitions_degenerates_to_done_credit PASS
When git history records no Feature/Epic/Initiative transitions, the engine credits only Story Done items.
gate_events empty → only Story Done credit fires 05 test_per_person_hours_sum_cross_items PASS
Per-person: hours on item X + hours on all other items = total capacity.
hours[P, X] + hours[P, rest] = capacity[P] 06 test_zero_contribution_excluded_both_views PASS
Items with zero contribution do not appear in either view.
zero contribution → absent in both views 07 test_single_contributor_full_share PASS
Single contributor on an item gets 100% share in per-item view.
single contributor → 100% share 08 test_two_equal_contributors_equal_split PASS
Two contributors with equal CW get 50/50 split in per-item view.
equal CW → 50/50 share split 09 test_capacity_no_affect_per_item_share PASS
Different capacities do not affect percentage share in per-item view.
capacity[P1] ≠ capacity[P2] → share% unchanged 10 test_cross_check_hours_vs_capacities PASS
Cross-check: sum of all per-item hours across all items ≤ sum of all capacities.
ΣΣ hours[P, item] ≤ Σ capacity[P] 11 test_three_contributors_weighted_split PASS
Three contributors with CW 1.0, 0.6, 0.25 — shares match weight ratio.
1.0:0.6:0.25 → 54%:32%:14% share 12 test_per_item_hours_sum_matches_js_proportion PASS
Sum of hours on an item from all persons matches the Job Size proportion in total budget.
item hours reflect JS weight in total budget 5. Edge Cases — 18 scenarios × 6 persons = 108 checks
Boundary cases and extreme scenarios that the EDPA engine must handle without crashing, with correct results and no precision loss.
01 test_person_zero_relevant_items PASS
Person with 0 relevant items must get 0 hours without crashing.
0 items → 0h, no crash 02 test_person_single_item_full_capacity PASS
Person with one item must get full capacity.
1 item → hours = capacity 03 test_all_items_same_job_size PASS
All items with the same Job Size — hours distributed only by CW.
same JS → distribution by CW only 04 test_all_people_same_cw_on_item PASS
All people with the same CW on an item — hours proportional to capacity.
same CW → hours proportional to capacity 05 test_job_size_zero_excluded PASS
Item with Job Size = 0 must be excluded from calculation (no division by zero).
JS = 0 → item excluded, no division by zero 06 test_single_person_team PASS
Single-person team: person gets full capacity regardless of CW.
single person → full capacity 07 test_hundred_items_capacity_sum PASS
100 items for one person — capacity must still sum correctly.
100 items: Σ hours = capacity 08 test_max_job_size_allocation PASS
Maximum Job Size (20) must produce correct proportional allocation.
JS = 20 → correct proportional allocation 09 test_min_job_size_allocation PASS
Minimum Job Size (1) must produce correct proportional allocation.
JS = 1 → correct proportional allocation 10 test_all_cw_equal_distribution PASS
All CW = 1.0 — hours distributed equally by Job Size.
all CW = 1.0 → equal distribution per JS 11 test_very_unequal_capacities PASS
Very unequal capacities (10h vs 160h) — each person sums to their own capacity.
10h + 160h: each sums to own capacity 12 test_floating_point_precision PASS
Float precision: sum must be within 0.01h tolerance of capacity.
Σ within 0.01h tolerance 13 test_unicode_item_titles PASS
Unicode characters in item titles must not cause processing errors.
Unicode titles → no processing errors 14 test_empty_iteration_graceful PASS
Empty iteration (no stories) must be handled without crashing.
empty iteration → graceful handling 15 test_person_only_epic_feature PASS
Person only on Epic/Feature (no stories) must still get allocation.
Epic/Feature only → still gets allocation 16 test_negative_job_size_rejected PASS
Negative Job Size must be rejected — no negative allocation.
JS < 0 → item rejected 17 test_duplicate_person_on_item_no_double_count PASS
Duplicate signals from the same person on an item must not double the allocation.
duplicate signals → single CW entry 18 test_large_team_scaling PASS
20+ people in a team — calculation still converges and invariants hold.
20+ people: all invariants hold 6. Auto-calibration — 12 scenarios
Verification of the auto-calibration system inspired by Karpathy's autoresearch pattern. Calibration must be safe, reproducible, and efficient.
01 test_minimum_ground_truth_records PASS
Calibration requires a minimum of 20 manually confirmed CW records.
len(ground_truth) ≥ 20 required 02 test_mad_computation_correctness PASS
MAD (Mean Absolute Deviation) is correctly computed as mean of |auto_cw - confirmed_cw|.
MAD = mean(|auto_cw - confirmed_cw|) 03 test_lower_mad_better PASS
Lower MAD = better heuristic. Optimization direction must be "lower is better".
direction: lower is better 04 test_calibrator_locked PASS
Calibrator (calibrate_signals.py) is locked — synthetic corpus + MAD cost function live in one file. The agent must not modify it (gaming).
calibrate_signals.py: LOCKED, read-only 05 test_single_change_per_iteration PASS
Each experiment changes only one parameter — isolation of change effects.
one parameter change per experiment 06 test_git_commit_after_experiment PASS
A git commit is made after each experiment — memory = git log.
git commit after each experiment 07 test_revert_on_worse_mad PASS
When MAD worsens, the experiment is reverted (git reset --hard HEAD~1).
MAD worse → git revert 08 test_keep_on_better_or_equal_mad PASS
When MAD improves or stays the same, the experiment is kept.
MAD better/equal → keep commit 09 test_budget_50_experiments_max PASS
Maximum budget is 50 experiments — protection against infinite loops.
budget ≤ 50 experiments 10 test_expected_improvement_range PASS
Expected improvement is 15–30% MAD reduction after 50 experiments.
expected: 15-30% MAD reduction 11 test_ground_truth_format_validation PASS
Ground truth records must contain: item_id, person_id, evidence_role, auto_cw, confirmed_cw.
required fields: item_id, person_id, evidence_role, auto_cw, confirmed_cw 12 test_no_data_leakage PASS
No data leakage between training and validation sets — strict separation.
no data leakage between train/validation 7. Governance & Audit — 17 scenarios
Verification of audit trail, freeze rules, governance processes, and compliance requirements. EDPA must be fully auditable and reproducible.
01 test_snapshot_frozen_after_close PASS
Snapshot is frozen after Iteration Close — must not be modified.
snapshot.frozen = true after close 02 test_frozen_snapshot_immutable PASS
Frozen snapshot must not be modified in-place.
frozen snapshot: no in-place modification 03 test_corrections_create_new_revision PASS
Corrections create a new revision (_rev2, _rev3), never overwrite the original.
correction → new revision (_rev2, _rev3) 04 test_snapshot_required_fields PASS
Snapshot must contain all 10 required top-level keys.
10 required keys present in snapshot 05 test_branch_naming_enforced PASS
Branch naming convention: {type}/{ITEM-ID}-description must be enforced.
branch: {type}/{ITEM-ID}-description 06 test_pr_references_work_item PASS
PR must reference a work item (S-XXX, F-XXX, E-XXX) in title or body.
PR references: S-XXX, F-XXX, or E-XXX 07 test_traceability_chain PASS
Full traceability chain: Initiative → Epic → Feature → Story → PR → Commit.
Initiative → Epic → Feature → Story → PR → Commit 08 test_wsjf_calculation PASS
WSJF is correctly calculated as (BV + TC + RR&OE) / JS.
WSJF = (BV + TC + RR&OE) / JS 09 test_job_size_guardrails_story PASS
Job Size guardrails for Story: JS ≤ 8 (ideally ≤ 5).
Story JS ≤ 8 (recommended ≤ 5) 10 test_job_size_guardrails_feature PASS
Job Size guardrails for Feature: JS ≤ 13.
Feature JS ≤ 13 11 test_job_size_guardrails_epic PASS
Job Size guardrails for Epic: JS ≤ 20.
Epic JS ≤ 20 12 test_dor_checklist_validation PASS
Definition of Ready checklist: description, AC, estimate, parent linked.
DoR: description, AC, estimate, parent linked 13 test_dod_checklist_validation PASS
Definition of Done checklist: code reviewed, tests passed, PR merged.
DoD: code reviewed, tests passed, PR merged 14 test_wip_limit_enforcement PASS
WIP limit: ideally 1 Story per person at any given time.
WIP limit: 1 Story per person (ideal) 15 test_bankid_signature_support PASS
BankID electronic signature support (Act 21/2020 Coll.).
BankID signature: Act 21/2020 Coll. 16 test_reproducible_calculation PASS
Reproducible calculation: same inputs must always produce same outputs.
same inputs → same outputs (deterministic) 17 test_audit_trail_five_pillars PASS
Audit trail covers 5 pillars: GitHub evidence, capacity, snapshot, reproducible calculation, signature.
5 pillars: evidence, capacity, snapshot, calc, signature 8. Capacity Planning — 12 scenarios
Verification of the Iteration Planning Protocol — planning_factor as a team-level property, 80% rule, buffer usage tracking, and capacity commitment workflow.
01 test_planning_factor_team_level PASS
planning_factor must be a team-level property, not a cadence or person-level property.
teams[].planning_factor (not cadence, not person) 02 test_planning_factor_default PASS
Default planning_factor must be 0.8 (plan to 80% of total capacity).
planning_factor default = 0.8 03 test_planning_factor_range PASS
planning_factor must be in range (0, 1.0] — never zero, never above 100%.
0 < planning_factor ≤ 1.0 04 test_planning_capacity_formula PASS
Planning Capacity = Total Capacity × planning_factor for each team.
Planning_Capacity = Σ Capacity[P] × planning_factor 05 test_different_teams_different_factors PASS
Different teams may have different planning_factor values.
teams[A].planning_factor ≠ teams[B].planning_factor allowed 06 test_edpa_uses_total_not_planning PASS
EDPA calculation always uses Total Capacity (100%), not Planning Capacity.
DerivedHours uses Capacity[P], not Planning_Capacity 07 test_buffer_absorbs_unplanned PASS
Buffer (20% by default) absorbs support, maintenance, incidents, and unplanned work.
buffer = Total - Planning → unplanned work 08 test_unplanned_items_generate_evidence PASS
Unplanned items in the buffer generate evidence and are allocated normally by EDPA.
unplanned items → evidence → normal EDPA allocation 09 test_capacity_confirmed_at_planning PASS
Each team member must confirm availability at Iteration Planning (availability: confirmed).
availability = confirmed required 10 test_planning_factor_no_affect_invariant PASS
planning_factor must not affect the mathematical guarantee Σ DerivedHours = Capacity.
planning_factor → no effect on Σ = Capacity 11 test_buffer_usage_metric PASS
Buffer_Usage metric tracks how much of the reserve was consumed by unplanned work.
Buffer_Usage = unplanned / (Total - Planning) × 100% 12 test_high_buffer_usage_warning PASS
Consistently high buffer usage (>90%) should trigger a warning to adjust capacity or scope.
Buffer_Usage > 90% → warning Auto-calibration (Karpathy loop)
Automatic calibration system inspired by Karpathy's autoresearch pattern. One file, one metric, one loop.
| Target | cw_heuristics.yaml.tmpl (signals: block) |
| Metric | MAD (Mean Absolute Deviation) on the MC corpus |
| Direction | lower is better |
| Budget | 2000 MC samples + coordinate descent on top-5 (~10s) |
| Search space | 3D signal weights, each in [0.1, 8.0] |
| Calibrator | calibrate_signals.py (LOCKED) |
- Typical improvement: 15–30% MAD reduction
- After 50 experiments: heuristic matches real team patterns
- Diminishing returns after ~30 experiments
- Prerequisite: ≥ 20 manually confirmed CW records
Loop
- Run
python calibrate_signals.py --scenarios 1000 --seed 42 - The script will:
- Generate a synthetic Monte Carlo corpus (1,000 scenarios × ~31k records) procedurally from a model where signal counts probabilistically reflect each person's true cw share
- Compute baseline MAD against shipped defaults (commit_author=4.00, pr_reviewer=2.17, issue_comment=1.46)
- Phase 1 — random sampling: 2000 random weight vectors in the 3D space [0.1, 8.0], sorted by MAD
- Phase 2 — coordinate descent: refines the top-5 candidates, tries ±step per signal, halves step on no-improvement
- Return best calibrated weights + MAD improvement %
- With
--apply, rewritecw_heuristics.yaml.tmpl+ refresh thecalibration:metadata
- Print summary: baseline MAD, calibrated MAD, % improvement, top weights
- (Optional)
--report report.jsondumps the full run for audit
Safety constraints
calibrate_signals.py. The synthetic corpus generator + MAD cost function live in one file by design; separation by structure prevents gaming.evaluate_mad() takes only a weight vector and pure-reads signal_count × weight with per-item normalization.Escalation strategy
| Experiments | Focus | Parameters |
|---|---|---|
| 1–30 | signal weights (Monte Carlo) | 3 parameters, random sampling |
| 31–50 | signal weights (Nelder-Mead refinement) | local descent around MC top-K |
CW Heuristic
Default weights for automatic Contribution Weight assignment based on GitHub signals.
EDPA doesn't store roles per person — derives them from dominant signal type for UX (timesheets, reports). Engine math sees only cw values.
| Dominant signal | Display role |
|---|---|
commit_author | owner |
manual:* | key |
pr_reviewer | reviewer |
issue_comment | consulted |
| Signal | Score |
|---|---|
commit_author | 4.00 |
pr_reviewer | 2.17 |
issue_comment | 1.46 |
manual:* (/contribute) | explicit |
contribution_score per (person, item). Per-item normalization gives cw = score / Σ_persons score. No "highest signal wins", no threshold — even one comment yields a proportional share.1,000 synthetic scenarios (32,210 records); 5 candidates converged to MAD 0.0805 after Nelder-Mead refinement.
Strategic-role bias correction: EDPA addresses strategic-role bias purely via signal weight calibration:
- BO/PM contributions show via
issue_comment+ manual/contributedirectives. Calibrator boostsissue_commentweight if BO/PM are under-credited. - Arch contributions show via
pr_reviewer. Calibrator boostspr_reviewerweight similarly. - Dev/QA reference baseline — Git accurately captures their work.
- Edge-case generator simulates
pm_driven,pair_programmed,design_heavy,silent_reviewerpatterns so calibration generalizes.
Note: Pre-pilot baseline is from synthetic scenarios. After kashealth pilot's first PI close, calibration runs against real ground truth (≥20 confirmed cw records).
Method comparison
| Criterion | EDPA v2.11.1 | Manual timesheets | Fixed allocation |
|---|---|---|---|
| Accuracy | High | Medium | Low |
| Effort | Minimal | High | None |
| Auditability | Full | Partial | None |
| Dual-view | Yes | No | No |
| Math. guarantee | Σ = capacity | None | Complex |
| Automation | GitHub Actions | Manual | Partial |
Demo calculation
Static demonstration of EDPA calculation for 3 people and 5 work items. Operational variant (Simple mode).
Capacity
| Person | FTE | Capacity (h) |
|---|---|---|
| Alice | 0.5 | 40h |
| Bob | 1.0 | 80h |
| Carol | 0.75 | 60h |
Work items & assignments (CW per-item normalized, Σ per row = 1.0)
| Item | JS | Alice (CW) | Bob (CW) | Carol (CW) | Σ |
|---|---|---|---|---|---|
S-101 | 5 | 0.70 | 0.30 | — | 1.00 |
S-102 | 3 | 0.55 | 0.45 | — | 1.00 |
S-103 | 8 | — | 0.65 | 0.35 | 1.00 |
S-104 | 2 | — | 0.30 | 0.70 | 1.00 |
S-105 | 5 | — | — | 1.00 | 1.00 |
Score calculation (Score = JS × CW)
| Item | JS | Alice Score | Bob Score | Carol Score |
|---|---|---|---|---|
S-101 | 5 | 3.50 | 1.50 | — |
S-102 | 3 | 1.65 | 1.35 | — |
S-103 | 8 | — | 5.20 | 2.80 |
S-104 | 2 | — | 0.60 | 1.40 |
S-105 | 5 | — | — | 5.00 |
| Σ | 5.15 | 8.65 | 9.20 |
Derived Hours (DH = Score / ΣScore × Capacity)
| Item | Alice (40h) | Bob (80h) | Carol (60h) |
|---|---|---|---|
S-101 | 27.18h | 13.87h | — |
S-102 | 12.82h | 12.49h | — |
S-103 | — | 48.09h | 18.26h |
S-104 | — | 5.55h | 9.13h |
S-105 | — | — | 32.61h |
| Σ | 40.00h | 80.00h | 60.00h |