PreclariEvals

Eval gold set · harness v0.2.0 · English

The two gold-set cases, end‑to‑end

How Preclari measures whether the server's regulatory preflight is actually correct — not just well-formed. This page walks the framework, the 12-dimension rubric, and the two expert-authored cases that make up the current gold set.

Ani — start here. Two ideas do all the work. (1) Every output is checked on two independent layers: is it valid PIF (structure), and is it right against an expert gold reference (substance). (2) Substance is scored blind, one dimension at a time, never collapsed into a single number — the per-dimension breakdown is the whole point. Read Case 1 first (the familiar one), then Case 2 (the hard one with the privacy trap). The gold references are the answer keys you'll be grading against.

What we're asking you to own

Ani — take the lead on evals & coverage

You're the owner of two questions: is the system actually correct, and how do we know. That means running the eval loop end-to-end and growing what it covers — not just scoring what exists today.

Own now — lead

  • The loop: run → grade blind → aggregate → report, against the rubric below.
  • Coverage: grow the gold set with breadth and rigour — toward 15–20 workflows at 60% familiar / 25% adjacent / 15% hard, every one categorized so scores disaggregate.
  • Integrity: keep gold references reconciled with the live corpus as it changes; track each dimension's score over time to see what's getting better or worse.

As the team grows

When a regulatory SME joins, they take over gold-reference authorship — the domain judgment of what's correct. You then go full-time on the eval system itself: harness rigour, the LLM-judge consensus automation, coverage breadth, and regression as the corpus and product move under you.

First tasks  (1) Run both cases below to learn the harness hands-on.  (2) Reconcile the gold references against the corrected corpus — start with the Swissmedic-CS-Note citation in Case 1, which we've since established is a phantom (the real reference is PIC/S PI 011-3).  (3) Propose the next 3–5 workflows to grow coverage across the tiers.

01Two layers — both must pass

PIF conformance and the eval rubric answer different questions. Neither substitutes for the other; a release needs both.

PIF conformance — structure

Is this output a valid PIF document? A schema check against PreflightAssertion. Deterministic, fast, automatable. Runs as a hard gate before any grading — if the assertion doesn't validate, the run halts (exit code 2) and nothing gets graded.

Eval rubric — substance

Is the content correct against expert judgment? Scored across 12 dimensions, partly subjective, slower, needs a gold reference and a grader. A PIF-valid output can still name the wrong regulations; a substance-correct output can still be malformed. Independent checks.

How grading works

Blind by construction

Each run is written twice: outputs-raw (model identity intact) and outputs-blinded (the produced_by model fields stripped). The grader opens only the blinded copy, so writing style can't bias the score. Reconciliation against raw happens only after all scores are committed.

Hybrid grading

  • Human — applicability reasoning, risk, clarifying Qs, out-of-scope, verification.
  • Deterministic — internal consistency, reproducibility (field equality).
  • LLM-as-judge + consensus — coverage/precision dims: Claude Opus and Gemini Pro match-lists are intersected; disagreements escalate.

Score types: ratio 0.0–1.0 (recall/precision) · anchored 1–5 (reasoning, questions, scope, verification) · directional 0–3 (risk — asymmetric, never mean-averaged) · binary pass/fail (consistency, reproducibility). The aggregator never collapses to a composite — a release blocks if any gated dimension misses its threshold (Gate 5.5).

02The 12-dimension rubric (v0.1)

Each output is graded one dimension at a time against the gold reference. A score below 3 (on the 1–5 dims) requires a written one-sentence reason — it forces real examination.

#DimensionTypeWhat it measures
1Requirements coverage (recall)ratioOf the gold's applicable requirements, how many did the candidate find? Missing a requirement is the most dangerous failure — invisible to a reviewer.
2Requirements precisionratioOf the requirements the candidate named, how many are actually valid? Over-citing (e.g. 21 CFR Part 11 with no US scope) costs here.
3Applicability reasoning1–5Is the applicability_basis chain sound? Score the reasoning, not the conclusion — right answer / bad chain is a 1–2.
4Missing-controls coverageratioDid it find the control gaps the gold lists? Semantic match, not string match.
5Missing-controls precisionratioAre the gaps it flagged real? Common miss: flagging a control that context_notes says is already present.
6Risk classification0–3Exact = 3; conservative miss = 2; permissive miss = 1 (worse — leaks risk); off by ≥2 or missing = 0.
7Assumption surfacingratioDid it surface the material assumptions (the ones whose being wrong changes the answer)? Usually the weakest dimension.
8Clarifying questions1–5Would the questions actually change the assertion if answered? Zero questions on a genuinely ambiguous workflow = 1.
9Out-of-scope completeness1–5Did it declare what it deliberately did not consider, with a clear "why not"?
10Verification actionability1–5Could a reviewer actually run the verification steps and reach a yes/no? Specific source + anchor = 5.
11Internal consistencypass/failDoes the human Brief say the same thing as the structured JSON (counts, risk level, cited regs)? Any mismatch = fail.
12Reproducibilitypass/failSame input + same corpus snapshot → same requirements, controls, risk on a re-run. Scored for the system, not per-workflow.

03What gets tested — across models, languages & runs

A single workflow isn't tested once. The same gold reference is the fixed answer key while three things vary: which model produced the output, which language it ran in, and whether the result holds on a repeat run. That's the point of the harness — it's a comparison engine, not a one-shot check.

Multi-model — the bake-off

The harness drives the same workflow through different model configurations — recorded in each run's model_config (a reasoning model + an extractive model, e.g. claude-opus-4-7 + claude-haiku, or the gemini-2.5-flash config used in the latest run). Every output is graded blind, so the per-dimension breakdown becomes an apples-to-apples comparison. The eval's real job: which model config is good enough to ship, dimension by dimension — never a single leaderboard number.

Multi-language — locale-aware

English today, but the set is locale-namespaced from day one (inputs/en/, gold-references/en/), so de / fr / it are additive, never a restructure. Two distinct things to test as languages land: cross-language eval (grade the same workflow in each language) and translation-quality regression (does the German rendering of an English assertion preserve the substance). Runs stratify as runs/<ts>/<locale>/ and reports/<locale>/.

Repeat-run reproducibility Every workflow is run twice on the same corpus snapshot (dimension 12). "Same" = same requirements, controls, and risk — wording may vary. The run pins corpus.snapshot_sha (the corpus tree SHA, not the repo head), so a corpus edit correctly invalidates a run for reproducibility while a docs-only edit doesn't. This catches unseeded sampling, cache effects, and silent model drift.

04Case 1 — Quality-deviation triage

familiar tier
wf_qd_triage_2026_001 Domain: GMP / quality systems AI role: recommendation Jurisdictions: EU, CH PIF v0.1

A familiar, well-controlled manufacturing case — the one the team has thought hardest about, so it carries the highest grading confidence. The expert challenge: correctly treat Switzerland as in-scope (Swissmedic aligns with PIC/S Annex 11), and catch change control for model updates as a gap.

Input — the workflow being preflighted

// WorkflowDescription fed to the MCP server
{
  "intent": "Classify incoming quality deviations by severity, suggest root-cause
            categories from historical patterns, and draft an initial investigation
            plan that a human quality engineer reviews and approves.",
  "ai_role": "recommendation",
  "data_classes": ["gxp_record", "quality_data", "manufacturing_data"],
  "jurisdictions": ["EU", "CH"],
  "human_gate": "approve_each",  "reversibility": "reversible",
  "lifecycle_stage": "pilot",    "risk_tolerance": "low",
  "gxp_domains_self_declared": ["GMP", "quality_systems", "data_integrity"],
  "context_notes": "Pilot: 20 deviations/month, single tablet line, Basel facility.
       Controls: trained QE reviewer, validated QMS, quarterly trend analysis.
       LLM output is a structured suggestion only; no action without QE sign-off."
}

Expected output — the gold reference (answer key)

Risk classification: MEDIUM  — GxP touchpoint, but strong controls (approve-each), narrow pilot, reversible. Low is defensible; high is not at this stage.

Applicable requirements (coverage / precision targets)

RequirementSourceStrengthBasis
EU-GMP-Annex-11-§4EU GMPStrongComputerized system influencing GxP decisions — validation expectations apply regardless of the human gate.
ICH-Q9-5.1ICHStrongRisk-management requirements for the quality system cover AI-assisted deviation classification.
MHRA-DI-2018-§6.6UK MHRAStrongALCOA+ data-integrity expectations for records influenced by computerized systems.
Swissmedic-CS-Note-2023SwissmedicStrongSwiss facility (Basel); Swissmedic aligns with PIC/S Annex 11. Do not treat CH as out-of-scope.
EU-GMP-Annex-15EU GMPModerateQualification/validation of the AI system as a computerized system in GMP scope.

Precision traps (must NOT appear): 21 CFR Part 11 (no US scope) · GDPR (no PII/PHI) · ICH-Q10 (PQS-level, not this workflow).

Expected missing controls

Control gapCriticalityWhy
Documented URSRequiredAnnex 11 §4 expects a URS for GxP-impacting computerized systems; none indicated.
Risk-based qualification planRequiredExpected before pilot; not described.
AI output attribution in recordRequiredALCOA+ requires distinguishing AI-generated from human-authored content.
Audit trail of AI invocationsRequiredPrompts/inputs/outputs should be logged for audit and trend review.
Change control for model updatesRecommendedA provider model update changes a GxP-impacting system. Models often miss this — sparse training data.
Periodic performance reviewRecommendedAI suggestion quality should be reviewed against human decisions to detect drift.

Material assumptions to surface

  • Basel facility holds a Swiss GMP authorization material
  • Products intended for EU + Swiss markets material
  • LLM via vendor API, not in-house material
  • QMS is validated (per notes) not material

Expected clarifying questions

  1. Is there an existing URS for this AI workflow?
  2. How is the AI suggestion captured in the QMS record (field / embedded / separate doc)?
  3. Is the LLM via vendor API, on-prem, or third-party integrator?
  4. Has the pilot qualification approach been documented?

Expected out-of-scope declarations

Grader notes
Contested call: change control for model updates as Required vs Recommended — either is defensible. What the author wants to see: CH treated as in-scope (Swissmedic ≈ EU GMP). Expected failure modes: missing the model-update control; over-citing 21 CFR Part 11. If a candidate deviates but in a defensible direction, mark for re-review rather than scoring down. confidence: medium-high

05Case 2 — Post-market-surveillance signal triage

hard tier
wf_pms_triage_2026_002 Domain: medtech PMS / vigilance AI role: draft Device: Class IIa infusion pumps PIF v0.2

The deliberately hard case — it carries latent privacy and SaMD-classification traps and uses the v0.2 regulatory_domains_self_declared CURIEs. It tests whether the system spots the GDPR trap hidden in "customer complaints" and routes it correctly, and whether it escalates risk to high because a false negative here means a missed adverse event.

Input — the workflow being preflighted

// PIF v0.2 WorkflowDescription — note the CURIE namespaces
{
  "intent": "Continuously ingest customer complaints, service logs, and public
            adverse-event databases for our Class IIa infusion pumps. Flag potential
            new safety signals and draft a weekly trend summary for the PRRC to review.",
  "ai_role": "draft",
  "data_classes": ["safety_data", "quality_data", "clinical_data"],
  "jurisdictions": ["EU", "CH"],
  "output_destination": "advisory",  "human_gate": "approve_batch",
  "lifecycle_stage": "design",
  "regulatory_domains_self_declared": [
     "eu_mdr:article_83_87_post_market_surveillance_and_vigilance",
     "iso:13485_quality_management_system",
     "ch_swissmedic:mepv_medical_devices_ordinance"
  ],
  "context_notes": "Internal regulatory-affairs use at our Swiss HQ. Does not make
       diagnostic decisions. PRRC reviews the weekly batched summary before logging
       any new safety signal in the QMS."
}

Expected output — the gold reference (answer key)

Risk classification: HIGH  — the LLM is a filter; a false negative drops a real safety signal the PRRC then never sees → failure to report under MDR Art. 87. The batch gate is a buffer, not a cure. Medium is only conditionally defensible and shows weaker MDR-vigilance understanding.

Applicable requirements

RequirementSourceStrengthBasis
EU-MDR-Art-83EU MDRStrongAutomates collection/review of post-market data — squarely the PMS-system requirement.
EU-MDR-Art-87EU MDRStrongMonitoring adverse-event databases; failure to escalate signals hits the incident-reporting obligation.
EU-MDR-PRRCEU MDRStrongPRRC is the named human-in-the-loop; Art. 15 obligations apply to their PMS role.
ISO-13485-8.2ISO 13485StrongClause 8.2 mandates a feedback system for early warning of quality problems.
Swissmedic-MepVSwissmedicStrongMedDO/MepV applies for Swiss-market PMS and vigilance, mirroring MDR.

Precision traps (must NOT appear as direct requirements): EU AI Act high-risk (the tool is internal QMS search/summarization, not a safety component of the device — category error) · SaMD / MDR Annex VIII Rule 11 (no per-patient diagnostic/therapeutic info).

Expected missing controls

Control gapCriticalityWhy
gxp:ai_output_attribution_in_recordRequiredThe drafted summary must be marked AI-generated before the PRRC signs off.
iso27001:A.9.4.1 (access)RequiredComplaints/logs often carry PHI/PII — access controls + minimization before data hits the LLM.
Recall/precision validation planRequiredThe AI is a filter — a plan must define the acceptable false-negative rate for dropping complaints.
gxp:periodic_performance_reviewRequiredSignal-detection logic must be audited against a human baseline to catch drift.

Material assumptions to surface

  • Complaints/logs are anonymized of PHI before ingestion material
    If not → GDPR (EU) + FADP (CH) fire, needing a DPA with the LLM vendor.
  • LLM is a QMS admin tool, not SaMD material
    If SaMD → the tool itself needs CE marking under MDR.

Expected clarifying questions

  1. Are complaints/logs scrubbed of PII/PHI before going to the LLM?
  2. Does the LLM hide data from the PRRC, or pass all raw data through while highlighting trends? (false-negative risk)
  3. Is the LLM hosted inside the company boundary, or a third-party cloud API?

Expected out-of-scope declarations

What this case tests
  1. The semantic routing trap — does it spot the GDPR/privacy trap in "customer complaints / service logs", surface it as an assumption, and route GDPR citations OUT of applicable_requirements and INTO a conditional recommendation?
  2. The new CURIEs — does it cleanly process the v0.2 eu_mdr: and iso: namespaces?
  3. Risk calibration — does it recognize a false negative = missed adverse event, elevating to HIGH despite the human gate?

06Latest run — gold-set-2 verdict

Capstone run 2026-05-25 (rev 4, post-rebase on main + medtech corpus from PR #35). Harness 0.2.0 · corpus snapshot of 16 sources · run config Gemini 2.5 Flash.

WorkflowPIF conformanceSwissmedic retrievedbehavioral.gdpr-routingVerdict
wf_pms_triage_2026_002 (hard, v0.2)PASSYes — 4 MepV requirementsfail (corpus-blocked, correct)COMPLETE PASS
wf_qd_triage_2026_001 (familiar, v0.1)PASSN/An/aCLEAN PASS

Both pass PIF schema validation with zero pif_violations. Swissmedic-MepV retrieves end-to-end (corpus + CURIE input both present after #35). GDPR routing correctly fires as fail — corpus-blocked, no GDPR source, working as designed. Pharma regression clean.

07Running an eval & where things live

Two commands

# 1 — produce a run (drives the server, gates on PIF)
npm run eval -- --workflow wf_qd_triage_2026_001

# 2 — grade blind, then aggregate to a report
cp evals/harness/_grades-template.yaml \
   evals/runs/<run-id>/grades.yaml
# …fill scores against outputs-blinded…
npm run eval:report -- --run <run-id>

Exit codes: 0 success · 1 runtime error · 2 PIF conformance failed (manifest captures the violation list).

Directory map

  • rubric.md — the 12 dimensions + anchors
  • thresholds.yaml — Gate 5.5 launch thresholds
  • gold-references/en/ — the answer keys (locale-namespaced)
  • inputs/en/ — WorkflowDescription JSONs
  • runs/ — timestamped outputs (gitignored)
  • reports/ — committed run verdicts
  • harness/ — runner code (Zod-validated manifests)

Eval-set composition (target) ~60% familiar workflows (highest grading confidence), ~25% adjacent domains (generalization), ~15% intentionally hard cases (honesty / refusal). Track each workflow's category so scores disaggregate — "85% familiar / 60% adjacent / 40% hard" tells you far more than one "75% overall".