Tests and evaluation of EDPA v1.0.0
314 verification checks — 114 scenarios × 6 team members
114 test scenarios across 8 categories. Per-person tests (invariants, dual-view, edge cases) run for each of the 6 team members → 314 total checks. All passing.
1. Mathematical Invariants — 10 scenarios × 6 persons = 60 checks
Fundamental mathematical guarantees of the EDPA engine — validated on every build. These tests verify that the model never breaks its key promises.
01 test_sum_equals_capacity PASS
Derived hours must match the declared capacity of the person.
Σ(hours) = capacity ± 0.01h 02 test_ratio_sum_equals_one PASS
Ratios must sum to 1.0 for every person with items.
Σ(ratio) = 1.0 ± 0.001 03 test_no_negative_hours PASS
No person may have negative derived hours.
All hours ≥ 0 04 test_no_negative_scores PASS
No score may be negative.
All scores ≥ 0 05 test_score_formula_simple PASS
In simple mode: Score is calculated as JS multiplied by CW.
Score = JS × CW 06 test_score_formula_full PASS
In full mode: Score is calculated as JS multiplied by CW multiplied by RS.
Score = JS × CW × RS 07 test_full_mode_invariants PASS
Full mode also guarantees that sum equals capacity.
Full mode: Σ = capacity ± 0.01h 08 test_all_invariants_flag PASS
The invariant_ok flag must reflect the actual check results.
invariant_ok reflects actual checks 09 test_empty_items_no_crash PASS
A person with 0 items must get 0h, without crashing.
Person with 0 items → 0h, no crash 10 test_cw_ordering PASS
CW must preserve ordering: owner ≥ key ≥ reviewer ≥ consulted.
owner ≥ key ≥ reviewer ≥ consulted 2. Evidence Detection — 15 scenarios
Verification of correct GitHub signal detection and their mapping to Evidence Score and Contribution Weight. Each signal must be correctly identified and scored.
01 test_assignee_detection_cw PASS
Issue assignee must be detected as owner with CW = 1.0.
assignee signal → CW = 1.0 (owner) 02 test_pr_author_without_assignee PASS
PR author without assignee role must get CW = 0.6 (key contributor).
pr_author signal → CW = 0.6 (key) 03 test_commit_author_only PASS
Person with only a commit (no assignee/PR) gets CW = 0.25.
commit_author signal → CW = 0.25 (reviewer) 04 test_pr_reviewer_detection PASS
PR reviewer must be detected with CW = 0.25 (reviewer role).
pr_reviewer signal → CW = 0.25 (reviewer) 05 test_issue_comment_only PASS
Person with only an issue comment gets CW = 0.15 (consulted).
issue_comment signal → CW = 0.15 (consulted) 06 test_multiple_signals_highest_wins PASS
With multiple signals (assignee + commit) the strongest signal wins.
assignee + commit → CW = 1.0 (highest wins) 07 test_contribute_command_detection PASS
/contribute command in issue body must be detected with CW = 0.6.
/contribute @user → CW = 0.6 (key) 08 test_contribute_weight_override PASS
/contribute with explicit weight overrides automatically detected CW.
/contribute @user weight:0.8 → CW = 0.8 09 test_branch_naming_story_extraction PASS
Branch feature/S-200-omop-parser must extract reference S-200.
Branch regex: S-\d+ → S-200 10 test_branch_naming_feature_extraction PASS
Branch feature/F-15-auth-module must extract reference F-15.
Branch regex: F-\d+ → F-15 11 test_branch_naming_epic_extraction PASS
Branch epic/E-3-platform must extract reference E-3.
Branch regex: E-\d+ → E-3 12 test_no_matching_signals_excluded PASS
Person with no signals on an item must not be assigned.
No signals → person excluded from item 13 test_evidence_score_threshold PASS
Evidence Score below threshold (< 1.0) causes person exclusion from item.
ES < threshold → excluded 14 test_in_progress_items_excluded PASS
Items in In-Progress status are not included in iteration calculation.
Status: In-Progress → excluded from calculation 15 test_commit_count_no_time_effect PASS
Commit count does not affect time (only relevance). 1 commit = 10 commits for CW.
commit_count independent of time allocation 3. CW Heuristics — 18 scenarios
Verification of heuristic weight correctness and rules for determining Contribution Weight. The heuristic must be consistent and reproducible.
01 test_default_role_weight_owner PASS
Default weight for owner role must be 1.0.
role_weights.owner = 1.0 02 test_default_role_weight_key PASS
Default weight for key role must be 0.6.
role_weights.key = 0.6 03 test_default_role_weight_reviewer PASS
Default weight for reviewer role must be 0.25.
role_weights.reviewer = 0.25 04 test_default_role_weight_consulted PASS
Default weight for consulted role must be 0.15.
role_weights.consulted = 0.15 05 test_signal_weights_assignee PASS
Assignee signal must have Evidence Score +4.0.
signals.assignee = 4.0 06 test_signal_weights_contribute PASS
Contribute_command signal must have Evidence Score +3.0.
signals.contribute_command = 3.0 07 test_signal_weights_pr_author PASS
PR author signal must have Evidence Score +2.0.
signals.pr_author = 2.0 08 test_signal_weights_commit PASS
Commit author signal must have Evidence Score +1.0.
signals.commit_author = 1.0 09 test_highest_signal_determines_cw PASS
CW is determined by the strongest signal, not the sum — no signal summing for CW.
CW = role_weights[highest_signal], no summing 10 test_manual_override_precedence PASS
Manual /contribute override must take precedence over auto-detection.
manual_cw ≠ null → use manual_cw 11 test_cw_strict_ordering PASS
CW ordering must be strict: owner ≥ key ≥ reviewer ≥ consulted.
1.0 ≥ 0.6 ≥ 0.25 ≥ 0.15 (strict) 12 test_cw_minimum_floor PASS
Minimum CW is 0.15 (consulted floor) — no person may have lower CW.
CW ≥ 0.15 (consulted floor) 13 test_cw_maximum_ceiling PASS
Maximum CW is 1.0 (owner ceiling) — no automatic weight exceeds 1.0.
CW ≤ 1.0 (owner ceiling) 14 test_rs_normalization PASS
Relevance Signal is normalized: RS = min(ES/maxES, 1.0).
RS = min(ES / max_ES, 1.0) 15 test_rs_range_validation PASS
RS must be in range 0 to 1.0 — never negative, never greater than 1.
0 ≤ RS ≤ 1.0 16 test_multiple_people_same_item PASS
Multiple people on the same item must have independent CW for each person.
CW[P1, item] independent of CW[P2, item] 17 test_same_person_multiple_items PASS
Same person on multiple items must have independent CW for each item.
CW[P, item1] independent of CW[P, item2] 18 test_architecture_role_detection PASS
Architecture/PM role detected via comments + /contribute command.
comment + /contribute → key/consulted role 4. Dual-View Consistency — 12 scenarios × 6 persons = 72 checks
EDPA provides two views — per-person and per-item. Both must be mutually consistent and sums must match in both directions.
01 test_per_person_sum_equals_capacity PASS
Per-person view: sum of derived hours = capacity for every person.
Σ DerivedHours[P, *] = Capacity[P] 02 test_per_item_shares_sum_100 PASS
Per-item view: sum of shares of all contributors = 100% for every item.
Σ shares[*, item] = 100% 03 test_same_cw_same_results_both_views PASS
Same CW must produce same results in both views.
per-person hours consistent with per-item shares 04 test_mode_switch_preserves_guarantee PASS
Switching mode simple → full preserves the Σ = Capacity guarantee.
simple → full: Σ = Capacity still holds 05 test_per_person_hours_sum_cross_items PASS
Per-person: hours on item X + hours on all other items = total capacity.
hours[P, X] + hours[P, rest] = capacity[P] 06 test_zero_contribution_excluded_both_views PASS
Items with zero contribution do not appear in either view.
zero contribution → absent in both views 07 test_single_contributor_full_share PASS
Single contributor on an item gets 100% share in per-item view.
single contributor → 100% share 08 test_two_equal_contributors_equal_split PASS
Two contributors with equal CW get 50/50 split in per-item view.
equal CW → 50/50 share split 09 test_capacity_no_affect_per_item_share PASS
Different capacities do not affect percentage share in per-item view.
capacity[P1] ≠ capacity[P2] → share% unchanged 10 test_cross_check_hours_vs_capacities PASS
Cross-check: sum of all per-item hours across all items ≤ sum of all capacities.
ΣΣ hours[P, item] ≤ Σ capacity[P] 11 test_three_contributors_weighted_split PASS
Three contributors with CW 1.0, 0.6, 0.25 — shares match weight ratio.
1.0:0.6:0.25 → 54%:32%:14% share 12 test_per_item_hours_sum_matches_js_proportion PASS
Sum of hours on an item from all persons matches the Job Size proportion in total budget.
item hours reflect JS weight in total budget 5. Edge Cases — 18 scenarios × 6 persons = 108 checks
Boundary cases and extreme scenarios that the EDPA engine must handle without crashing, with correct results and no precision loss.
01 test_person_zero_relevant_items PASS
Person with 0 relevant items must get 0 hours without crashing.
0 items → 0h, no crash 02 test_person_single_item_full_capacity PASS
Person with one item must get full capacity.
1 item → hours = capacity 03 test_all_items_same_job_size PASS
All items with the same Job Size — hours distributed only by CW.
same JS → distribution by CW only 04 test_all_people_same_cw_on_item PASS
All people with the same CW on an item — hours proportional to capacity.
same CW → hours proportional to capacity 05 test_job_size_zero_excluded PASS
Item with Job Size = 0 must be excluded from calculation (no division by zero).
JS = 0 → item excluded, no division by zero 06 test_single_person_team PASS
Single-person team: person gets full capacity regardless of CW.
single person → full capacity 07 test_hundred_items_capacity_sum PASS
100 items for one person — capacity must still sum correctly.
100 items: Σ hours = capacity 08 test_max_job_size_allocation PASS
Maximum Job Size (20) must produce correct proportional allocation.
JS = 20 → correct proportional allocation 09 test_min_job_size_allocation PASS
Minimum Job Size (1) must produce correct proportional allocation.
JS = 1 → correct proportional allocation 10 test_all_cw_equal_distribution PASS
All CW = 1.0 — hours distributed equally by Job Size.
all CW = 1.0 → equal distribution per JS 11 test_very_unequal_capacities PASS
Very unequal capacities (10h vs 160h) — each person sums to their own capacity.
10h + 160h: each sums to own capacity 12 test_floating_point_precision PASS
Float precision: sum must be within 0.01h tolerance of capacity.
Σ within 0.01h tolerance 13 test_unicode_item_titles PASS
Unicode characters in item titles must not cause processing errors.
Unicode titles → no processing errors 14 test_empty_iteration_graceful PASS
Empty iteration (no stories) must be handled without crashing.
empty iteration → graceful handling 15 test_person_only_epic_feature PASS
Person only on Epic/Feature (no stories) must still get allocation.
Epic/Feature only → still gets allocation 16 test_negative_job_size_rejected PASS
Negative Job Size must be rejected — no negative allocation.
JS < 0 → item rejected 17 test_duplicate_person_on_item_no_double_count PASS
Duplicate signals from the same person on an item must not double the allocation.
duplicate signals → single CW entry 18 test_large_team_scaling PASS
20+ people in a team — calculation still converges and invariants hold.
20+ people: all invariants hold 6. Auto-calibration — 12 scenarios
Verification of the auto-calibration system inspired by Karpathy's autoresearch pattern. Calibration must be safe, reproducible, and efficient.
01 test_minimum_ground_truth_records PASS
Calibration requires a minimum of 20 manually confirmed CW records.
len(ground_truth) ≥ 20 required 02 test_mad_computation_correctness PASS
MAD (Mean Absolute Deviation) is correctly computed as mean of |auto_cw - confirmed_cw|.
MAD = mean(|auto_cw - confirmed_cw|) 03 test_lower_mad_better PASS
Lower MAD = better heuristic. Optimization direction must be "lower is better".
direction: lower is better 04 test_evaluator_locked PASS
Evaluator (evaluate_cw.py) is locked — must not be modified by the optimizer.
evaluate_cw.py: LOCKED, read-only 05 test_single_change_per_iteration PASS
Each experiment changes only one parameter — isolation of change effects.
one parameter change per experiment 06 test_git_commit_after_experiment PASS
A git commit is made after each experiment — memory = git log.
git commit after each experiment 07 test_revert_on_worse_mad PASS
When MAD worsens, the experiment is reverted (git reset --hard HEAD~1).
MAD worse → git revert 08 test_keep_on_better_or_equal_mad PASS
When MAD improves or stays the same, the experiment is kept.
MAD better/equal → keep commit 09 test_budget_50_experiments_max PASS
Maximum budget is 50 experiments — protection against infinite loops.
budget ≤ 50 experiments 10 test_expected_improvement_range PASS
Expected improvement is 15–30% MAD reduction after 50 experiments.
expected: 15-30% MAD reduction 11 test_ground_truth_format_validation PASS
Ground truth records must contain: item_id, person_id, evidence_role, auto_cw, confirmed_cw.
required fields: item_id, person_id, evidence_role, auto_cw, confirmed_cw 12 test_no_data_leakage PASS
No data leakage between training and validation sets — strict separation.
no data leakage between train/validation 7. Governance & Audit — 17 scenarios
Verification of audit trail, freeze rules, governance processes, and compliance requirements. EDPA must be fully auditable and reproducible.
01 test_snapshot_frozen_after_close PASS
Snapshot is frozen after Iteration Close — must not be modified.
snapshot.frozen = true after close 02 test_frozen_snapshot_immutable PASS
Frozen snapshot must not be modified in-place.
frozen snapshot: no in-place modification 03 test_corrections_create_new_revision PASS
Corrections create a new revision (_rev2, _rev3), never overwrite the original.
correction → new revision (_rev2, _rev3) 04 test_snapshot_required_fields PASS
Snapshot must contain all 10 required top-level keys.
10 required keys present in snapshot 05 test_branch_naming_enforced PASS
Branch naming convention: {type}/{ITEM-ID}-description must be enforced.
branch: {type}/{ITEM-ID}-description 06 test_pr_references_work_item PASS
PR must reference a work item (S-XXX, F-XXX, E-XXX) in title or body.
PR references: S-XXX, F-XXX, or E-XXX 07 test_traceability_chain PASS
Full traceability chain: Initiative → Epic → Feature → Story → PR → Commit.
Initiative → Epic → Feature → Story → PR → Commit 08 test_wsjf_calculation PASS
WSJF is correctly calculated as (BV + TC + RR) / JS.
WSJF = (BV + TC + RR) / JS 09 test_job_size_guardrails_story PASS
Job Size guardrails for Story: JS ≤ 8 (ideally ≤ 5).
Story JS ≤ 8 (recommended ≤ 5) 10 test_job_size_guardrails_feature PASS
Job Size guardrails for Feature: JS ≤ 13.
Feature JS ≤ 13 11 test_job_size_guardrails_epic PASS
Job Size guardrails for Epic: JS ≤ 20.
Epic JS ≤ 20 12 test_dor_checklist_validation PASS
Definition of Ready checklist: description, AC, estimate, parent linked.
DoR: description, AC, estimate, parent linked 13 test_dod_checklist_validation PASS
Definition of Done checklist: code reviewed, tests passed, PR merged.
DoD: code reviewed, tests passed, PR merged 14 test_wip_limit_enforcement PASS
WIP limit: ideally 1 Story per person at any given time.
WIP limit: 1 Story per person (ideal) 15 test_bankid_signature_support PASS
BankID electronic signature support (Act 21/2020 Coll.).
BankID signature: Act 21/2020 Coll. 16 test_reproducible_calculation PASS
Reproducible calculation: same inputs must always produce same outputs.
same inputs → same outputs (deterministic) 17 test_audit_trail_five_pillars PASS
Audit trail covers 5 pillars: GitHub evidence, capacity, snapshot, reproducible calculation, signature.
5 pillars: evidence, capacity, snapshot, calc, signature 8. Capacity Planning — 12 scenarios
Verification of the Iteration Planning Protocol — planning_factor as a team-level property, 80% rule, buffer usage tracking, and capacity commitment workflow.
01 test_planning_factor_team_level PASS
planning_factor must be a team-level property, not a cadence or person-level property.
teams[].planning_factor (not cadence, not person) 02 test_planning_factor_default PASS
Default planning_factor must be 0.8 (plan to 80% of total capacity).
planning_factor default = 0.8 03 test_planning_factor_range PASS
planning_factor must be in range (0, 1.0] — never zero, never above 100%.
0 < planning_factor ≤ 1.0 04 test_planning_capacity_formula PASS
Planning Capacity = Total Capacity × planning_factor for each team.
Planning_Capacity = Σ Capacity[P] × planning_factor 05 test_different_teams_different_factors PASS
Different teams may have different planning_factor values.
teams[A].planning_factor ≠ teams[B].planning_factor allowed 06 test_edpa_uses_total_not_planning PASS
EDPA calculation always uses Total Capacity (100%), not Planning Capacity.
DerivedHours uses Capacity[P], not Planning_Capacity 07 test_buffer_absorbs_unplanned PASS
Buffer (20% by default) absorbs support, maintenance, incidents, and unplanned work.
buffer = Total - Planning → unplanned work 08 test_unplanned_items_generate_evidence PASS
Unplanned items in the buffer generate evidence and are allocated normally by EDPA.
unplanned items → evidence → normal EDPA allocation 09 test_capacity_confirmed_at_planning PASS
Each team member must confirm availability at Iteration Planning (availability: confirmed).
availability = confirmed required 10 test_planning_factor_no_affect_invariant PASS
planning_factor must not affect the mathematical guarantee Σ DerivedHours = Capacity.
planning_factor → no effect on Σ = Capacity 11 test_buffer_usage_metric PASS
Buffer_Usage metric tracks how much of the reserve was consumed by unplanned work.
Buffer_Usage = unplanned / (Total - Planning) × 100% 12 test_high_buffer_usage_warning PASS
Consistently high buffer usage (>90%) should trigger a warning to adjust capacity or scope.
Buffer_Usage > 90% → warning Auto-calibration (Karpathy loop)
Automatic calibration system inspired by Karpathy's autoresearch pattern. One file, one metric, one loop.
| Target | cw_heuristics.yaml |
| Metric | MAD (Mean Absolute Deviation) |
| Direction | lower is better |
| Budget | 50 experiments (~2h) |
| Memory | git log on calibration branch |
| Evaluator | evaluate_cw.py (LOCKED) |
- Typical improvement: 15–30% MAD reduction
- After 50 experiments: heuristic matches real team patterns
- Diminishing returns after ~30 experiments
- Prerequisite: ≥ 20 manually confirmed CW records
Loop
git checkout -b calibration/{timestamp}- For each experiment (1..budget):
- Load current heuristic + experiment history
- Propose ONE parameter change (threshold, weight, signal score)
git commit -m "exp {n}: {param} {old} -> {new}"- Run:
python evaluate_cw.py --ground-truth ... --heuristics ... - Parse MAD from output
- If MAD < previous_best: KEEP | Otherwise: REVERT
- Log to
calibration_log.tsv
- Print summary: initial MAD, final MAD, % improvement
- Ask user: merge calibration branch into main?
Safety constraints
evaluate_cw.py. Separation of optimizer from objective function.Escalation strategy
| Experiments | Focus | Parameters |
|---|---|---|
| 1–10 | role_weights | 4 parameters (highest impact) |
| 11–25 | signal weights | 6 parameters |
| 26–50 | threshold + fine-tuning | combined tuning |
CW Heuristic
Default weights for automatic Contribution Weight assignment based on GitHub signals.
| Role | CW | Visualization |
|---|---|---|
| owner | 1.0 | |
| key | 0.6 | |
| reviewer | 0.25 | |
| consulted | 0.15 |
| Signal | Score |
|---|---|
assignee | 4.0 |
contribute_command | 3.0 |
pr_author | 2.0 |
commit_author | 1.0 |
pr_reviewer | 1.0 |
issue_comment | 0.5 |
Validated by Monte Carlo simulation (1,000 scenarios, 68,156 records, p<0.001).
Git measures activity, not value. Strategic roles (BO, PM, Arch) are systematically undervalued.
| Role | Bias |
|---|---|
| Business Owner | +0.15 |
| Product Manager | +0.05 |
| Architect | +0.05 |
| Developer | 0.00 |
Method comparison
| Criterion | EDPA v1.0.0 | Manual timesheets | Fixed allocation |
|---|---|---|---|
| Accuracy | High | Medium | Low |
| Effort | Minimal | High | None |
| Auditability | Full | Partial | None |
| Dual-view | Yes | No | No |
| Math. guarantee | Σ = capacity | None | Complex |
| Automation | GitHub Actions | Manual | Partial |
Demo calculation
Static demonstration of EDPA calculation for 3 people and 5 work items. Operational variant (Simple mode).
Capacity
| Person | FTE | Capacity (h) |
|---|---|---|
| Alice | 0.5 | 40h |
| Bob | 1.0 | 80h |
| Carol | 0.75 | 60h |
Work items & assignments
| Item | JS | Alice (CW) | Bob (CW) | Carol (CW) |
|---|---|---|---|---|
S-101 | 5 | 1.0 (owner) | 0.25 (reviewer) | — |
S-102 | 3 | 0.6 (key) | 1.0 (owner) | — |
S-103 | 8 | — | 1.0 (owner) | 0.6 (key) |
S-104 | 2 | — | 0.25 (reviewer) | 1.0 (owner) |
S-105 | 5 | — | — | 1.0 (owner) |
Score calculation (Score = JS × CW)
| Item | JS | Alice Score | Bob Score | Carol Score |
|---|---|---|---|---|
S-101 | 5 | 5.0 | 1.25 | — |
S-102 | 3 | 1.8 | 3.0 | — |
S-103 | 8 | — | 8.0 | 4.8 |
S-104 | 2 | — | 0.5 | 2.0 |
S-105 | 5 | — | — | 5.0 |
| Σ | 6.8 | 12.75 | 11.8 |
Derived Hours (DH = Score / ΣScore × Capacity)
| Item | Alice (40h) | Bob (80h) | Carol (60h) |
|---|---|---|---|
S-101 | 29.41h | 7.84h | — |
S-102 | 10.59h | 18.82h | — |
S-103 | — | 50.20h | 24.41h |
S-104 | — | 3.14h | 10.17h |
S-105 | — | — | 25.42h |
| Σ | 40.00h | 80.00h | 60.00h |