HomeWorkSonny

Multi-Agent Intelligence System

Sonny

Applying the scientific method to product development.

The context

At Generate Biomedicines, I was part of a cross-functional team evaluating targets for an ADC partnership. My team handled target biology: mechanism of action, expression patterns, scientific feasibility. While we understood the clinical, commercial, and regulatory landscapes at a high level, we relied on specialized teams to provide the depth of analysis these domains required.

Each group worked in parallel for weeks before reconvening. Scheduling was difficult, and when we finally met, a significant portion of the time went to walking colleagues through target biology concepts, explaining the mechanism of action and why it mattered for the partnership. We often ran out of time before reaching decisions.

The question

Can domain-specialized agents, working in concert, produce analysis trustworthy enough that teams would use it to propose or kill targets within days rather than weeks?

The goal is not to replace expertise. It is to provide synthesized biology, competitive intelligence, financial context, and regulatory landscape so meetings become decision-making sessions rather than information-sharing sessions.

Every figure, claim, and source must be traceable and correct before any function will act on it.

The hypothesis

Precision is a function of specialization, not scale.

A single agent covering biology, clinical, financial, regulatory, and IP analysis would stretch its context window thin and lose depth. If each agent focused on one domain with tailored prompts, relevant context, and appropriate model selection, outputs would be more reliable and verifiable.

The body doesn't deploy one universal immune cell against every pathogen. Rather, it coordinates dozens of specialized immune cell types, each recognizing and eliminating specific antigens through distinct mechanisms. A generalist LLM trained on the entire internet filters through massive noise to find signal.

But specialization alone isn't enough. Every claim needs a verifiable source. I designed for an audit trail that allows verification within 30 seconds.

This led to deliberate model selection

Claude

for Biology, Clinical, Regulatory

Constitutional AI training produces strict formatting compliance and conservative interpretation. 200K token context window ingests full study protocols.

Perplexity

for Patent, Market Research

Reasoning engine over live web and patent databases. Bridges natural language queries to Boolean patent syntax.

Gemini

for Financial Analysis

Million-plus token window holds a complete 10-K with market reports. Native multimodal capability parses pipeline charts and cap tables from images.

The experiment

To test whether specialized agents could deliver on this, I built a three-agent system and ran it against a project I had worked on previously, where ground truth existed for validation.

I scoped the first version to three agents: Target Biology, Clinical, Patent Expert. Only after validating this core loop did I add Regulatory, Market, and Financial agents.

Success criteria

  • Citation protocol: Is every claim traceable?
  • Factual accuracy: Are patent numbers and PubMed links real?
  • Reasoning quality: Does the analysis reflect domain expert thinking?

I accepted 30-60 second response times for chain-of-thought reasoning and self-correction loops.

The results

Three full queries, each validated against prior work. Manual citation review, paper verification, numerical spot-checks.

What worked

Telling agents what not to do reduced fabricated citations more than elaborating desired outputs. Adding explicit acceptance criteria (e.g., "Do not cite a source unless the link resolves") dropped error rates further.

What failed

Analysis stayed surface-level. Adequate for first-pass summary, not for paid analyst work. On one target, Sonny missed a novel mechanism of action generating significant interest from major players, which would have changed the competitive assessment. Models lack training on recent research, which limits utility for competitive intelligence.

The original design synthesized findings into recommendations ("proceed with the $2.2B acquisition"). I removed this. Only the acquiring company knows its internal metrics and strategic priorities. I removed the recommendation function. The system surfaces information; decision-making stays with the team.

The pivot: from advisor to analyst

The system's role shifted from advisor to intelligence analyst.

Before: Sonny resolved ambiguity. Clinical Agent says "Promising," Patent Agent says "Crowded," Sonny weighs and recommends.

After: Sonny exposes ambiguity. "Financial Agent projects 2026 launch. Regulatory Agent identifies 12-month delay risk." The system flags the gap rather than resolving it.

I redesigned the orchestration logic based on early results showing that forced consensus often produced hallucinated compromises.

Architectural changes

Layout-aware parsing

I implemented layout-aware parsing after text extraction destroyed table structure. Converting PDFs to Markdown preserves headers and row alignment.

Hybrid search

I added hybrid search. Semantic search finds "PD-1 inhibitors." Keyword search forces exact matches for "US Patent 10,123,456."

Critic agent

I added a critic agent that runs after each response with a verification prompt: "Does every claim have a citation from provided context? If not, delete." This catches unsupported claims before output.

Tone normalization

I built a tone normalization step that strips value-laden adjectives ("impressive," "disappointing"). It rewrites prose into flat observations. Every sentence preserves source metadata.

Architecture (roadmap)

The diagram below maps the target architecture for Sonny 2.0. Solid outlines mark capabilities that are live today; dashed outlines mark planned enhancements including complexity-aware routing, persistent memory, quality gates, and self-improving feedback loops.

S

Sonny 2.0 Architecture

ROADMAP

Multi-Agent Orchestration ยท Biotech Due Diligence

Live
Planned
Input / Routing
Orchestration
Agent Execution
Quality Assurance
Feedback Loop
Output
USER INPUT LAYERโŒจQuery + DocumentsTarget, persona, workspace contextโ—‡Complexity AnalyzerClassify query โ†’ route execution๐Ÿ‘คPersona ContextScientist / Scout๐Ÿ’พPersistent MemoryWorkspace history?COMPLEX: 4-6+ AGENTSSIMPLE: 1-2 AGENTSORCHESTRATOR โ€” SONNY 2.0โšกDynamic Execution PlannerSelect required agents, build planCORE AGENTS โ€” ALWAYS ACTIVELIVEDYNAMIC AGENT POOL โ€” ON DEMANDLIVE๐Ÿ”ฌClinical Analyst๐Ÿ“ŠFinancial Analyst๐Ÿ“‹Patent Expert๐ŸŒMarket Researchโš–Regulatory๐ŸงฌTarget Biology+Custom SpecialistExtensibleEXECUTION LAYERโซธParallel Executionโ‡‰Parallel Tool Calls๐Ÿง Extended Thinking BlockCross-agent synthesis & reasoningread / writePOST-PROCESSING โ€” QUALITY GATEPLANNED๐Ÿ”ŽCitation AgentVerify all sources๐Ÿ”Critic AgentEpistemic rigorโœ”LLM EvaluatorScore quality & confidenceSCOREโ‰ฅ threshold๐Ÿ“„Final ReportVerified, cited outputโŠžโ†’ Tile RendererLUMINA Dashboard< thresholdFEEDBACK LOOPPLANNEDโ†ปMeta-AgentDiagnose failureโœŽPrompt ImproverRefine promptsMax 2 retries โ†’ UNKNOWNV2.0 2026

Key Design Principles

Complexity-aware routing minimizes latency for simple queries. Quality gates with citation verification enforce zero-hallucination output. Feedback loops with max retry bounds default to UNKNOWN rather than fabricate. Persistent memory enables workspace-level continuity across sessions.

Solid outlines = currently implemented
Dashed outlines = planned for v2.0

What I learned & open questions

Layout-aware parsing should have been implemented from the start

Treating PDFs as text strings flattened table structure, which caused most of the early numerical errors.

Design for ambiguity earlier

My initial goal was clean, unified answers. In practice, forcing consensus often produced hallucinated compromises. Flagging contradictions proved more useful than resolving them.

Source hierarchy matters in retrieval

Early versions cited press releases over peer-reviewed studies when the press release parsed more easily. I added weighting so PMID-verified papers rank above news snippets.

Most errors traced back to input data, not model reasoning

Improving document parsing and source selection had more impact than prompt refinement.

Can an LLM detect what's missing?

In drug development, omissions often matter more than disclosures. A company may present 4-week efficacy data without mentioning the 8-week results, or highlight tumor response while omitting liver enzyme elevations. The signal often lives in what's missing.

Disclosure requirements vary by stage. Clinical trials mandate comprehensive reporting of adverse events. Preclinical programs don't. A target can quietly disappear from a pipeline due to toxicity, strategic reprioritization, or simple resource constraints, and no explanation is required. The data package won't tell you which.

Most models are built to summarize what's present. They excel at compressing evidence, but struggle to recognize silence. I'm not yet convinced an agent can reliably flag suspicious omissions without the kind of intuition that comes from watching programs advance, stall, and fail over years. That intuition may be the hardest thing to encode, and the most valuable thing to retain.

Limitations & future considerations

Model evolution

These model selection choices reflect the landscape as of late 2025. LLMs will only improve, and the tradeoffs will shift.

Notably, improvement doesn't uniformly reduce risk. A smarter model can produce more dangerous hallucinations because they become reasoned hallucinations: the model can explain why it generated a fabricated figure, making the error sound incredibly convincing.

This raises the bar for verification, not lowers it. The architecture must account for the possibility that the tools get sharper but the failure modes get subtler. This is what I'm watching most closely as the models evolve.