Sonny • Grounded biomedical agent

A research agent that shows its work.

Every claim it makes traces to a real record, and a second model checks the citation holds before you ever see it.

ROLE

Solo builder

TIMELINE

2025–present

TEAM

Solo (0→1)

SKILLS

TypeScript, multi-agent orchestration, MCP, grounding + verification, evals

Overview

Can one agent do the work of a diligence team?

Biotech due diligence normally takes a room full of specialists: legal reviewing the patents, BD sizing the strategy, clinical and preclinical reviewers weighing the data, and executive teams evaluating the evidence and business case. The question I set out to answer was whether a single agent could carry that load, or at least the evidence-gathering underneath it.

The hard part was never the writing. It was trust. A general model writes fluent citations that don’t resolve, and reads a fuzzy sequence off a patent scan then corrects it into the wrong one. In diligence, a wrong PMID or a single wrong base pair is not a rounding error. So the whole build points at one thing: guardrails that make every output checkable.

I named it after my son. Sonny is a multi-agent research agent for biotech that only makes claims it can support and verify.

Problem

Fluent is not the same as grounded.

General models can produce answers that sound authoritative while citing papers that do not exist or trials that are slightly wrong. For diligence work, that is a dangerous failure mode because it looks correct even when it is not.

Sonny addresses this with two checks. First, a structural gate verifies that every citation token resolves to a real retrieved record. Then a verifier checks whether the cited passage actually supports the claim. If the evidence is missing, Sonny abstains instead of inventing a story.

Architecture

The pipeline is the product.

I built Sonny as a TypeScript monorepo around a small, legible core. Every claim, from any skill, moves through the same five stages before it reaches you.

Orchestrator

plans, fans out, streams the trace

MCP gateway

Open Targets · PubMed · ClinicalTrials.gov

Evidence store

every record registered by canonical ID

Grounding gate

no citation, no claim

Verifier

different model family checks each claim

Glass-box trace

The whole run is observable: you watch it search and read what it found as it happens.

TypeScript monorepoMCP serversClaude skillsSlack-firstBLAST verificationEval harness + CI

In use, a scientist asks a question in Slack. The orchestrator plans, fans the sub-questions out to specialist skills over a shared tool belt, and streams the trace as it goes: searching ClinicalTrials.gov, fourteen trials found, nine claims verified, one flagged. You read the reasoning as it forms instead of staring at a spinner.

Hard case

Extracting biological sequences from patent scans.

This is where general models are most dangerous: they often turn a blurry sequence into a plausible one, changing a base or two with complete confidence. For sequence work, that is worse than failing loudly.

Sonny treats the scan as the least reliable source. It uses the patent number to retrieve the canonical sequence from public databases and verifies it by exact match. Only if no canonical source exists does it fall back to OCR, where it constrains decoding to valid residues, checks the patent’s stated length, and BLASTs the fragment to identify the likely sequence. Each result includes provenance and a confidence flag.

Decisions

Heterogeneous models, with a verifier that disagrees.

I route models by job shape: a strong planner for orchestration, cheaper models for extraction, and a verifier from a different family than the writer so it can catch blind spots.

Sonny ships as an installable MCP server with Claude skills, designed Slack-first for scientists. Callers bring their own API key, and the system keeps no user data. The sources are free and public.

The verify-every-claim design costs latency and roughly doubles token spend, but that is the tradeoff for diligence. The evaluation harness compares a raw model, the same model with web search, and Sonny on a small golden set plus a fictional-gene abstention trap. The key metric is verifier performance on claims that already pass the citation gate: supported, unsupported, or overreach.

Roadmap

Eval-first, in slices.

I sequence the roadmap so every capability lands with a measurement in front of it. Nothing ships without a number that can catch it regressing.

Shipped

Eval harness + golden set in CI

Capture the verifier baseline

Then

Abstention calibration + reranker

Later

Figure extraction + contradiction checks

Reflection

Trust is an architecture problem, not a prompt.

A trustworthy agent does not come from asking nicely in the system prompt. It comes from building a system that cannot ship an unverified claim. The real shift is from prompting for better behavior to enforcing it in the architecture, and that becomes visible in everything the agent refuses to say.