Eval gold set · harness v0.2.0 · English
How Preclari measures whether the server's regulatory preflight is actually correct — not just well-formed. This page walks the framework, the 12-dimension rubric, and the two expert-authored cases that make up the current gold set.
What we're asking you to own
You're the owner of two questions: is the system actually correct, and how do we know. That means running the eval loop end-to-end and growing what it covers — not just scoring what exists today.
When a regulatory SME joins, they take over gold-reference authorship — the domain judgment of what's correct. You then go full-time on the eval system itself: harness rigour, the LLM-judge consensus automation, coverage breadth, and regression as the corpus and product move under you.
First tasks (1) Run both cases below to learn the harness hands-on. (2) Reconcile the gold references against the corrected corpus — start with the Swissmedic-CS-Note citation in Case 1, which we've since established is a phantom (the real reference is PIC/S PI 011-3). (3) Propose the next 3–5 workflows to grow coverage across the tiers.
PIF conformance and the eval rubric answer different questions. Neither substitutes for the other; a release needs both.
Is this output a valid PIF document? A schema check against PreflightAssertion. Deterministic, fast, automatable. Runs as a hard gate before any grading — if the assertion doesn't validate, the run halts (exit code 2) and nothing gets graded.
Is the content correct against expert judgment? Scored across 12 dimensions, partly subjective, slower, needs a gold reference and a grader. A PIF-valid output can still name the wrong regulations; a substance-correct output can still be malformed. Independent checks.
Each run is written twice: outputs-raw (model identity intact) and outputs-blinded (the produced_by model fields stripped). The grader opens only the blinded copy, so writing style can't bias the score. Reconciliation against raw happens only after all scores are committed.
Score types: ratio 0.0–1.0 (recall/precision) · anchored 1–5 (reasoning, questions, scope, verification) · directional 0–3 (risk — asymmetric, never mean-averaged) · binary pass/fail (consistency, reproducibility). The aggregator never collapses to a composite — a release blocks if any gated dimension misses its threshold (Gate 5.5).
Each output is graded one dimension at a time against the gold reference. A score below 3 (on the 1–5 dims) requires a written one-sentence reason — it forces real examination.
| # | Dimension | Type | What it measures |
|---|---|---|---|
| 1 | Requirements coverage (recall) | ratio | Of the gold's applicable requirements, how many did the candidate find? Missing a requirement is the most dangerous failure — invisible to a reviewer. |
| 2 | Requirements precision | ratio | Of the requirements the candidate named, how many are actually valid? Over-citing (e.g. 21 CFR Part 11 with no US scope) costs here. |
| 3 | Applicability reasoning | 1–5 | Is the applicability_basis chain sound? Score the reasoning, not the conclusion — right answer / bad chain is a 1–2. |
| 4 | Missing-controls coverage | ratio | Did it find the control gaps the gold lists? Semantic match, not string match. |
| 5 | Missing-controls precision | ratio | Are the gaps it flagged real? Common miss: flagging a control that context_notes says is already present. |
| 6 | Risk classification | 0–3 | Exact = 3; conservative miss = 2; permissive miss = 1 (worse — leaks risk); off by ≥2 or missing = 0. |
| 7 | Assumption surfacing | ratio | Did it surface the material assumptions (the ones whose being wrong changes the answer)? Usually the weakest dimension. |
| 8 | Clarifying questions | 1–5 | Would the questions actually change the assertion if answered? Zero questions on a genuinely ambiguous workflow = 1. |
| 9 | Out-of-scope completeness | 1–5 | Did it declare what it deliberately did not consider, with a clear "why not"? |
| 10 | Verification actionability | 1–5 | Could a reviewer actually run the verification steps and reach a yes/no? Specific source + anchor = 5. |
| 11 | Internal consistency | pass/fail | Does the human Brief say the same thing as the structured JSON (counts, risk level, cited regs)? Any mismatch = fail. |
| 12 | Reproducibility | pass/fail | Same input + same corpus snapshot → same requirements, controls, risk on a re-run. Scored for the system, not per-workflow. |
A single workflow isn't tested once. The same gold reference is the fixed answer key while three things vary: which model produced the output, which language it ran in, and whether the result holds on a repeat run. That's the point of the harness — it's a comparison engine, not a one-shot check.
The harness drives the same workflow through different model configurations — recorded in each run's model_config (a reasoning model + an extractive model, e.g. claude-opus-4-7 + claude-haiku, or the gemini-2.5-flash config used in the latest run). Every output is graded blind, so the per-dimension breakdown becomes an apples-to-apples comparison. The eval's real job: which model config is good enough to ship, dimension by dimension — never a single leaderboard number.
English today, but the set is locale-namespaced from day one (inputs/en/, gold-references/en/), so de / fr / it are additive, never a restructure. Two distinct things to test as languages land: cross-language eval (grade the same workflow in each language) and translation-quality regression (does the German rendering of an English assertion preserve the substance). Runs stratify as runs/<ts>/<locale>/ and reports/<locale>/.
Repeat-run reproducibility Every workflow is run twice on the same corpus snapshot (dimension 12). "Same" = same requirements, controls, and risk — wording may vary. The run pins corpus.snapshot_sha (the corpus tree SHA, not the repo head), so a corpus edit correctly invalidates a run for reproducibility while a docs-only edit doesn't. This catches unseeded sampling, cache effects, and silent model drift.
A familiar, well-controlled manufacturing case — the one the team has thought hardest about, so it carries the highest grading confidence. The expert challenge: correctly treat Switzerland as in-scope (Swissmedic aligns with PIC/S Annex 11), and catch change control for model updates as a gap.
// WorkflowDescription fed to the MCP server { "intent": "Classify incoming quality deviations by severity, suggest root-cause categories from historical patterns, and draft an initial investigation plan that a human quality engineer reviews and approves.", "ai_role": "recommendation", "data_classes": ["gxp_record", "quality_data", "manufacturing_data"], "jurisdictions": ["EU", "CH"], "human_gate": "approve_each", "reversibility": "reversible", "lifecycle_stage": "pilot", "risk_tolerance": "low", "gxp_domains_self_declared": ["GMP", "quality_systems", "data_integrity"], "context_notes": "Pilot: 20 deviations/month, single tablet line, Basel facility. Controls: trained QE reviewer, validated QMS, quarterly trend analysis. LLM output is a structured suggestion only; no action without QE sign-off." }
Risk classification: MEDIUM — GxP touchpoint, but strong controls (approve-each), narrow pilot, reversible. Low is defensible; high is not at this stage.
| Requirement | Source | Strength | Basis |
|---|---|---|---|
| EU-GMP-Annex-11-§4 | EU GMP | Strong | Computerized system influencing GxP decisions — validation expectations apply regardless of the human gate. |
| ICH-Q9-5.1 | ICH | Strong | Risk-management requirements for the quality system cover AI-assisted deviation classification. |
| MHRA-DI-2018-§6.6 | UK MHRA | Strong | ALCOA+ data-integrity expectations for records influenced by computerized systems. |
| Swissmedic-CS-Note-2023 | Swissmedic | Strong | Swiss facility (Basel); Swissmedic aligns with PIC/S Annex 11. Do not treat CH as out-of-scope. |
| EU-GMP-Annex-15 | EU GMP | Moderate | Qualification/validation of the AI system as a computerized system in GMP scope. |
Precision traps (must NOT appear): 21 CFR Part 11 (no US scope) · GDPR (no PII/PHI) · ICH-Q10 (PQS-level, not this workflow).
| Control gap | Criticality | Why |
|---|---|---|
| Documented URS | Required | Annex 11 §4 expects a URS for GxP-impacting computerized systems; none indicated. |
| Risk-based qualification plan | Required | Expected before pilot; not described. |
| AI output attribution in record | Required | ALCOA+ requires distinguishing AI-generated from human-authored content. |
| Audit trail of AI invocations | Required | Prompts/inputs/outputs should be logged for audit and trend review. |
| Change control for model updates | Recommended | A provider model update changes a GxP-impacting system. Models often miss this — sparse training data. |
| Periodic performance review | Recommended | AI suggestion quality should be reviewed against human decisions to detect drift. |
The deliberately hard case — it carries latent privacy and SaMD-classification traps and uses the v0.2 regulatory_domains_self_declared CURIEs. It tests whether the system spots the GDPR trap hidden in "customer complaints" and routes it correctly, and whether it escalates risk to high because a false negative here means a missed adverse event.
// PIF v0.2 WorkflowDescription — note the CURIE namespaces { "intent": "Continuously ingest customer complaints, service logs, and public adverse-event databases for our Class IIa infusion pumps. Flag potential new safety signals and draft a weekly trend summary for the PRRC to review.", "ai_role": "draft", "data_classes": ["safety_data", "quality_data", "clinical_data"], "jurisdictions": ["EU", "CH"], "output_destination": "advisory", "human_gate": "approve_batch", "lifecycle_stage": "design", "regulatory_domains_self_declared": [ "eu_mdr:article_83_87_post_market_surveillance_and_vigilance", "iso:13485_quality_management_system", "ch_swissmedic:mepv_medical_devices_ordinance" ], "context_notes": "Internal regulatory-affairs use at our Swiss HQ. Does not make diagnostic decisions. PRRC reviews the weekly batched summary before logging any new safety signal in the QMS." }
Risk classification: HIGH — the LLM is a filter; a false negative drops a real safety signal the PRRC then never sees → failure to report under MDR Art. 87. The batch gate is a buffer, not a cure. Medium is only conditionally defensible and shows weaker MDR-vigilance understanding.
| Requirement | Source | Strength | Basis |
|---|---|---|---|
| EU-MDR-Art-83 | EU MDR | Strong | Automates collection/review of post-market data — squarely the PMS-system requirement. |
| EU-MDR-Art-87 | EU MDR | Strong | Monitoring adverse-event databases; failure to escalate signals hits the incident-reporting obligation. |
| EU-MDR-PRRC | EU MDR | Strong | PRRC is the named human-in-the-loop; Art. 15 obligations apply to their PMS role. |
| ISO-13485-8.2 | ISO 13485 | Strong | Clause 8.2 mandates a feedback system for early warning of quality problems. |
| Swissmedic-MepV | Swissmedic | Strong | MedDO/MepV applies for Swiss-market PMS and vigilance, mirroring MDR. |
Precision traps (must NOT appear as direct requirements): EU AI Act high-risk (the tool is internal QMS search/summarization, not a safety component of the device — category error) · SaMD / MDR Annex VIII Rule 11 (no per-patient diagnostic/therapeutic info).
| Control gap | Criticality | Why |
|---|---|---|
| gxp:ai_output_attribution_in_record | Required | The drafted summary must be marked AI-generated before the PRRC signs off. |
| iso27001:A.9.4.1 (access) | Required | Complaints/logs often carry PHI/PII — access controls + minimization before data hits the LLM. |
| Recall/precision validation plan | Required | The AI is a filter — a plan must define the acceptable false-negative rate for dropping complaints. |
| gxp:periodic_performance_review | Required | Signal-detection logic must be audited against a human baseline to catch drift. |
applicable_requirements and INTO a conditional recommendation?eu_mdr: and iso: namespaces?Capstone run 2026-05-25 (rev 4, post-rebase on main + medtech corpus from PR #35). Harness 0.2.0 · corpus snapshot of 16 sources · run config Gemini 2.5 Flash.
| Workflow | PIF conformance | Swissmedic retrieved | behavioral.gdpr-routing | Verdict |
|---|---|---|---|---|
| wf_pms_triage_2026_002 (hard, v0.2) | PASS | Yes — 4 MepV requirements | fail (corpus-blocked, correct) | COMPLETE PASS |
| wf_qd_triage_2026_001 (familiar, v0.1) | PASS | N/A | n/a | CLEAN PASS |
Both pass PIF schema validation with zero pif_violations. Swissmedic-MepV retrieves end-to-end (corpus + CURIE input both present after #35). GDPR routing correctly fires as fail — corpus-blocked, no GDPR source, working as designed. Pharma regression clean.
# 1 — produce a run (drives the server, gates on PIF) npm run eval -- --workflow wf_qd_triage_2026_001 # 2 — grade blind, then aggregate to a report cp evals/harness/_grades-template.yaml \ evals/runs/<run-id>/grades.yaml # …fill scores against outputs-blinded… npm run eval:report -- --run <run-id>
Exit codes: 0 success · 1 runtime error · 2 PIF conformance failed (manifest captures the violation list).
rubric.md — the 12 dimensions + anchorsthresholds.yaml — Gate 5.5 launch thresholdsgold-references/en/ — the answer keys (locale-namespaced)inputs/en/ — WorkflowDescription JSONsruns/ — timestamped outputs (gitignored)reports/ — committed run verdictsharness/ — runner code (Zod-validated manifests)Eval-set composition (target) ~60% familiar workflows (highest grading confidence), ~25% adjacent domains (generalization), ~15% intentionally hard cases (honesty / refusal). Track each workflow's category so scores disaggregate — "85% familiar / 60% adjacent / 40% hard" tells you far more than one "75% overall".