Eval gold set · harness v0.2.0 · English

The two gold-set cases, end‑to‑end

How Preclari measures whether the server's regulatory preflight is actually correct — not just well-formed. This page walks the framework, the 12-dimension rubric, and the two expert-authored cases that make up the current gold set.

Ani — start here. Two ideas do all the work. (1) Every output is checked on two independent layers: is it valid PIF (structure), and is it right against an expert gold reference (substance). (2) Substance is scored blind, one dimension at a time, never collapsed into a single number — the per-dimension breakdown is the whole point. Read Case 1 first (the familiar one), then Case 2 (the hard one with the privacy trap). The gold references are the answer keys you'll be grading against.

What we're asking you to own

Ani — take the lead on evals & coverage

You're the owner of two questions: is the system actually correct, and how do we know. That means running the eval loop end-to-end and growing what it covers — not just scoring what exists today.

Own now — lead

The loop: run → grade blind → aggregate → report, against the rubric below.
Coverage: grow the gold set with breadth and rigour — toward 15–20 workflows at 60% familiar / 25% adjacent / 15% hard, every one categorized so scores disaggregate.
Integrity: keep gold references reconciled with the live corpus as it changes; track each dimension's score over time to see what's getting better or worse.

As the team grows

When a regulatory SME joins, they take over gold-reference authorship — the domain judgment of what's correct. You then go full-time on the eval system itself: harness rigour, the LLM-judge consensus automation, coverage breadth, and regression as the corpus and product move under you.

First tasks (1) Run both cases below to learn the harness hands-on. (2) Reconcile the gold references against the corrected corpus — start with the Swissmedic-CS-Note citation in Case 1, which we've since established is a phantom (the real reference is PIC/S PI 011-3). (3) Propose the next 3–5 workflows to grow coverage across the tiers.

01Two layers — both must pass

PIF conformance and the eval rubric answer different questions. Neither substitutes for the other; a release needs both.

PIF conformance — structure

Is this output a valid PIF document? A schema check against PreflightAssertion. Deterministic, fast, automatable. Runs as a hard gate before any grading — if the assertion doesn't validate, the run halts (exit code 2) and nothing gets graded.

Eval rubric — substance

Is the content correct against expert judgment? Scored across 12 dimensions, partly subjective, slower, needs a gold reference and a grader. A PIF-valid output can still name the wrong regulations; a substance-correct output can still be malformed. Independent checks.

How grading works

Blind by construction

Each run is written twice: outputs-raw (model identity intact) and outputs-blinded (the produced_by model fields stripped). The grader opens only the blinded copy, so writing style can't bias the score. Reconciliation against raw happens only after all scores are committed.

Hybrid grading

Human — applicability reasoning, risk, clarifying Qs, out-of-scope, verification.
Deterministic — internal consistency, reproducibility (field equality).
LLM-as-judge + consensus — coverage/precision dims: Claude Opus and Gemini Pro match-lists are intersected; disagreements escalate.

Score types: ratio 0.0–1.0 (recall/precision) · anchored 1–5 (reasoning, questions, scope, verification) · directional 0–3 (risk — asymmetric, never mean-averaged) · binary pass/fail (consistency, reproducibility). The aggregator never collapses to a composite — a release blocks if any gated dimension misses its threshold (Gate 5.5).

02The 12-dimension rubric (v0.1)

Each output is graded one dimension at a time against the gold reference. A score below 3 (on the 1–5 dims) requires a written one-sentence reason — it forces real examination.

#	Dimension	Type	What it measures
1	Requirements coverage (recall)	ratio	Of the gold's applicable requirements, how many did the candidate find? Missing a requirement is the most dangerous failure — invisible to a reviewer.
2	Requirements precision	ratio	Of the requirements the candidate named, how many are actually valid? Over-citing (e.g. 21 CFR Part 11 with no US scope) costs here.
3	Applicability reasoning	1–5	Is the `applicability_basis` chain sound? Score the reasoning, not the conclusion — right answer / bad chain is a 1–2.
4	Missing-controls coverage	ratio	Did it find the control gaps the gold lists? Semantic match, not string match.
5	Missing-controls precision	ratio	Are the gaps it flagged real? Common miss: flagging a control that `context_notes` says is already present.
6	Risk classification	0–3	Exact = 3; conservative miss = 2; permissive miss = 1 (worse — leaks risk); off by ≥2 or missing = 0.
7	Assumption surfacing	ratio	Did it surface the material assumptions (the ones whose being wrong changes the answer)? Usually the weakest dimension.
8	Clarifying questions	1–5	Would the questions actually change the assertion if answered? Zero questions on a genuinely ambiguous workflow = 1.
9	Out-of-scope completeness	1–5	Did it declare what it deliberately did not consider, with a clear "why not"?
10	Verification actionability	1–5	Could a reviewer actually run the verification steps and reach a yes/no? Specific source + anchor = 5.
11	Internal consistency	pass/fail	Does the human Brief say the same thing as the structured JSON (counts, risk level, cited regs)? Any mismatch = fail.
12	Reproducibility	pass/fail	Same input + same corpus snapshot → same requirements, controls, risk on a re-run. Scored for the system, not per-workflow.

03What gets tested — across models, languages & runs

A single workflow isn't tested once. The same gold reference is the fixed answer key while three things vary: which model produced the output, which language it ran in, and whether the result holds on a repeat run. That's the point of the harness — it's a comparison engine, not a one-shot check.

Multi-model — the bake-off

The harness drives the same workflow through different model configurations — recorded in each run's model_config (a reasoning model + an extractive model, e.g. claude-opus-4-7 + claude-haiku, or the gemini-2.5-flash config used in the latest run). Every output is graded blind, so the per-dimension breakdown becomes an apples-to-apples comparison. The eval's real job: which model config is good enough to ship, dimension by dimension — never a single leaderboard number.

Multi-language — locale-aware

English today, but the set is locale-namespaced from day one (inputs/en/, gold-references/en/), so de / fr / it are additive, never a restructure. Two distinct things to test as languages land: cross-language eval (grade the same workflow in each language) and translation-quality regression (does the German rendering of an English assertion preserve the substance). Runs stratify as runs/<ts>/<locale>/ and reports/<locale>/.

Repeat-run reproducibility Every workflow is run twice on the same corpus snapshot (dimension 12). "Same" = same requirements, controls, and risk — wording may vary. The run pins corpus.snapshot_sha (the corpus tree SHA, not the repo head), so a corpus edit correctly invalidates a run for reproducibility while a docs-only edit doesn't. This catches unseeded sampling, cache effects, and silent model drift.

04Case 1 — Quality-deviation triage

familiar tier

wf_qd_triage_2026_001 Domain: GMP / quality systems AI role: recommendation Jurisdictions: EU, CH PIF v0.1

A familiar, well-controlled manufacturing case — the one the team has thought hardest about, so it carries the highest grading confidence. The expert challenge: correctly treat Switzerland as in-scope (Swissmedic aligns with PIC/S Annex 11), and catch change control for model updates as a gap.

Input — the workflow being preflighted

// WorkflowDescription fed to the MCP server
{
  "intent": "Classify incoming quality deviations by severity, suggest root-cause
            categories from historical patterns, and draft an initial investigation
            plan that a human quality engineer reviews and approves.",
  "ai_role": "recommendation",
  "data_classes": ["gxp_record", "quality_data", "manufacturing_data"],
  "jurisdictions": ["EU", "CH"],
  "human_gate": "approve_each",  "reversibility": "reversible",
  "lifecycle_stage": "pilot",    "risk_tolerance": "low",
  "gxp_domains_self_declared": ["GMP", "quality_systems", "data_integrity"],
  "context_notes": "Pilot: 20 deviations/month, single tablet line, Basel facility.
       Controls: trained QE reviewer, validated QMS, quarterly trend analysis.
       LLM output is a structured suggestion only; no action without QE sign-off."
}

Expected output — the gold reference (answer key)

Risk classification: MEDIUM — GxP touchpoint, but strong controls (approve-each), narrow pilot, reversible. Low is defensible; high is not at this stage.

Applicable requirements (coverage / precision targets)

Requirement	Source	Strength	Basis
EU-GMP-Annex-11-§4	EU GMP	Strong	Computerized system influencing GxP decisions — validation expectations apply regardless of the human gate.
ICH-Q9-5.1	ICH	Strong	Risk-management requirements for the quality system cover AI-assisted deviation classification.
MHRA-DI-2018-§6.6	UK MHRA	Strong	ALCOA+ data-integrity expectations for records influenced by computerized systems.
Swissmedic-CS-Note-2023	Swissmedic	Strong	Swiss facility (Basel); Swissmedic aligns with PIC/S Annex 11. Do not treat CH as out-of-scope.
EU-GMP-Annex-15	EU GMP	Moderate	Qualification/validation of the AI system as a computerized system in GMP scope.

Precision traps (must NOT appear): 21 CFR Part 11 (no US scope) · GDPR (no PII/PHI) · ICH-Q10 (PQS-level, not this workflow).

Expected missing controls

Control gap	Criticality	Why
Documented URS	Required	Annex 11 §4 expects a URS for GxP-impacting computerized systems; none indicated.
Risk-based qualification plan	Required	Expected before pilot; not described.
AI output attribution in record	Required	ALCOA+ requires distinguishing AI-generated from human-authored content.
Audit trail of AI invocations	Required	Prompts/inputs/outputs should be logged for audit and trend review.
Change control for model updates	Recommended	A provider model update changes a GxP-impacting system. Models often miss this — sparse training data.
Periodic performance review	Recommended	AI suggestion quality should be reviewed against human decisions to detect drift.

Material assumptions to surface

Basel facility holds a Swiss GMP authorization material
Products intended for EU + Swiss markets material
LLM via vendor API, not in-house material
QMS is validated (per notes) not material

Expected clarifying questions

Is there an existing URS for this AI workflow?
How is the AI suggestion captured in the QMS record (field / embedded / separate doc)?
Is the LLM via vendor API, on-prem, or third-party integrator?
Has the pilot qualification approach been documented?

Expected out-of-scope declarations

FDA 21 CFR Part 11 / US GMP — no US jurisdiction
Pharmacovigilance signal detection — this is manufacturing QD, not adverse events
GDPR data-subject rights — no PII/PHI in data classes
Production-scale deployment — preflight covers the pilot only
Cybersecurity / IT general controls — outside preflight scope

Grader notes
Contested call: change control for model updates as Required vs Recommended — either is defensible. What the author wants to see: CH treated as in-scope (Swissmedic ≈ EU GMP). Expected failure modes: missing the model-update control; over-citing 21 CFR Part 11. If a candidate deviates but in a defensible direction, mark for re-review rather than scoring down. confidence: medium-high

05Case 2 — Post-market-surveillance signal triage

hard tier

wf_pms_triage_2026_002 Domain: medtech PMS / vigilance AI role: draft Device: Class IIa infusion pumps PIF v0.2

The deliberately hard case — it carries latent privacy and SaMD-classification traps and uses the v0.2 regulatory_domains_self_declared CURIEs. It tests whether the system spots the GDPR trap hidden in "customer complaints" and routes it correctly, and whether it escalates risk to high because a false negative here means a missed adverse event.

Input — the workflow being preflighted

// PIF v0.2 WorkflowDescription — note the CURIE namespaces
{
  "intent": "Continuously ingest customer complaints, service logs, and public
            adverse-event databases for our Class IIa infusion pumps. Flag potential
            new safety signals and draft a weekly trend summary for the PRRC to review.",
  "ai_role": "draft",
  "data_classes": ["safety_data", "quality_data", "clinical_data"],
  "jurisdictions": ["EU", "CH"],
  "output_destination": "advisory",  "human_gate": "approve_batch",
  "lifecycle_stage": "design",
  "regulatory_domains_self_declared": [
     "eu_mdr:article_83_87_post_market_surveillance_and_vigilance",
     "iso:13485_quality_management_system",
     "ch_swissmedic:mepv_medical_devices_ordinance"
  ],
  "context_notes": "Internal regulatory-affairs use at our Swiss HQ. Does not make
       diagnostic decisions. PRRC reviews the weekly batched summary before logging
       any new safety signal in the QMS."
}

Expected output — the gold reference (answer key)

Risk classification: HIGH — the LLM is a filter; a false negative drops a real safety signal the PRRC then never sees → failure to report under MDR Art. 87. The batch gate is a buffer, not a cure. Medium is only conditionally defensible and shows weaker MDR-vigilance understanding.

Applicable requirements

Requirement	Source	Strength	Basis
EU-MDR-Art-83	EU MDR	Strong	Automates collection/review of post-market data — squarely the PMS-system requirement.
EU-MDR-Art-87	EU MDR	Strong	Monitoring adverse-event databases; failure to escalate signals hits the incident-reporting obligation.
EU-MDR-PRRC	EU MDR	Strong	PRRC is the named human-in-the-loop; Art. 15 obligations apply to their PMS role.
ISO-13485-8.2	ISO 13485	Strong	Clause 8.2 mandates a feedback system for early warning of quality problems.
Swissmedic-MepV	Swissmedic	Strong	MedDO/MepV applies for Swiss-market PMS and vigilance, mirroring MDR.

Precision traps (must NOT appear as direct requirements): EU AI Act high-risk (the tool is internal QMS search/summarization, not a safety component of the device — category error) · SaMD / MDR Annex VIII Rule 11 (no per-patient diagnostic/therapeutic info).

Expected missing controls

Control gap	Criticality	Why
gxp:ai_output_attribution_in_record	Required	The drafted summary must be marked AI-generated before the PRRC signs off.
iso27001:A.9.4.1 (access)	Required	Complaints/logs often carry PHI/PII — access controls + minimization before data hits the LLM.
Recall/precision validation plan	Required	The AI is a filter — a plan must define the acceptable false-negative rate for dropping complaints.
gxp:periodic_performance_review	Required	Signal-detection logic must be audited against a human baseline to catch drift.

Material assumptions to surface

Complaints/logs are anonymized of PHI before ingestion material
If not → GDPR (EU) + FADP (CH) fire, needing a DPA with the LLM vendor.
LLM is a QMS admin tool, not SaMD material
If SaMD → the tool itself needs CE marking under MDR.

Expected clarifying questions

Are complaints/logs scrubbed of PII/PHI before going to the LLM?
Does the LLM hide data from the PRRC, or pass all raw data through while highlighting trends? (false-negative risk)
Is the LLM hosted inside the company boundary, or a third-party cloud API?

Expected out-of-scope declarations

GDPR / Swiss FADP — out of scope pending clarification on whether complaints carry unredacted patient data. (The exact test of the conditional-assumption router.)
US FDA 21 CFR 820 / 803 — US jurisdiction not declared.
MDR conformity assessment of the infusion pump — preflight assesses the AI tool, not the physical device.

What this case tests

The semantic routing trap — does it spot the GDPR/privacy trap in "customer complaints / service logs", surface it as an assumption, and route GDPR citations OUT of applicable_requirements and INTO a conditional recommendation?
The new CURIEs — does it cleanly process the v0.2 eu_mdr: and iso: namespaces?
Risk calibration — does it recognize a false negative = missed adverse event, elevating to HIGH despite the human gate?

06Latest run — gold-set-2 verdict

Capstone run 2026-05-25 (rev 4, post-rebase on main + medtech corpus from PR #35). Harness 0.2.0 · corpus snapshot of 16 sources · run config Gemini 2.5 Flash.

Workflow	PIF conformance	Swissmedic retrieved	behavioral.gdpr-routing	Verdict
wf_pms_triage_2026_002 (hard, v0.2)	PASS	Yes — 4 MepV requirements	fail (corpus-blocked, correct)	COMPLETE PASS
wf_qd_triage_2026_001 (familiar, v0.1)	PASS	N/A	n/a	CLEAN PASS

Both pass PIF schema validation with zero pif_violations. Swissmedic-MepV retrieves end-to-end (corpus + CURIE input both present after #35). GDPR routing correctly fires as fail — corpus-blocked, no GDPR source, working as designed. Pharma regression clean.

07Setup, running an eval & where things live

Setup · one-time

# clone + Node 22 + install
git clone https://github.com/cfpramod/preclari-mcp.git
cd preclari-mcp
nvm install 22
npm install

# your Gemini AI Studio key (all three matter)
export LLM_PROVIDER=gemini
export GOOGLE_GENAI_USE_VERTEXAI=false   # AI Studio (your key), not Vertex
export GEMINI_API_KEY=<your-aistudio-key>

The engine defaults to Google Vertex (which needs a GCP project you do not have). Set GOOGLE_GENAI_USE_VERTEXAI=false to use your AI Studio key instead. Default model is gemini-2.5-flash, so leave LLM_MODEL unset. Get a key at aistudio.google.com (Get API key); never commit it.

Run · two commands

# 1 — produce a run (drives the server, gates on PIF)
npm run eval -- --workflow wf_qd_triage_2026_001

# 2 — grade blind, then aggregate to a report
cp evals/harness/_grades-template.yaml \
   evals/runs/<run-id>/grades.yaml
# …fill scores against outputs-blinded…
npm run eval:report -- --run <run-id>

Exit codes: 0 success · 1 runtime error · 2 PIF conformance failed (manifest captures the violation list).

Directory map

rubric.md — the 12 dimensions + anchors
thresholds.yaml — Gate 5.5 launch thresholds
gold-references/en/ — the answer keys (locale-namespaced)
inputs/en/ — WorkflowDescription JSONs
runs/ — timestamped outputs (gitignored)
reports/ — committed run verdicts
harness/ — runner code (Zod-validated manifests)

Your task · authoring eval sets

An eval set is an input (a workflow scenario) plus a gold reference (the expert answer key). Both live under a locale dir (en/). The gold reference is the reg-affairs judgment the engine is graded against, so it is your core work.

Create a new eval set

# 1 - input: copy a scenario, then edit it
cp evals/inputs/en/wf_qd_triage_2026_001.json \
   evals/inputs/en/wf_qd_triage_2026_003.json

# 2 - gold reference: your expert judgment
cp evals/gold-references/en/_template.md \
   evals/gold-references/en/wf_qd_triage_2026_003.md

# 3 - run it, read the score, iterate
npm run eval -- --workflow wf_qd_triage_2026_003

ID convention wf_<category>_<YYYY>_<NNN> (category from _template.md). Fill the applicable requirements with citations, the controls to flag missing, the risk class, the assumptions, and the verification steps.

Contribute via PR

main is ruleset-protected (no direct push). Add your sets on a branch and open a PR for Pramod to review and merge.

git checkout -b ani/evals-qd-2026-003
# stage only your files (never git add -A: tracked node_modules symlink)
git add evals/inputs/en/wf_qd_triage_2026_003.json \
        evals/gold-references/en/wf_qd_triage_2026_003.md
git commit -m "evals: add qd_triage 2026_003 set"
git push -u origin ani/evals-qd-2026-003
gh pr create   # Pramod reviews + merges

Eval-set composition (target) ~60% familiar workflows (highest grading confidence), ~25% adjacent domains (generalization), ~15% intentionally hard cases (honesty / refusal). Track each workflow's category so scores disaggregate — "85% familiar / 60% adjacent / 40% hard" tells you far more than one "75% overall".