The context
At Generate Biomedicines, I was part of a cross-functional team evaluating targets for an ADC partnership. My team handled target biology: mechanism of action, expression patterns, scientific feasibility. While we understood the clinical, commercial, and regulatory landscapes at a high level, we relied on specialized teams to provide the depth of analysis these domains required.
Each group worked in parallel for weeks before reconvening. Scheduling was difficult, and when we finally met, a significant portion of the time went to walking colleagues through target biology concepts, explaining the mechanism of action and why it mattered for the partnership. We often ran out of time before reaching decisions.
The question
Can domain-specialized agents, working in concert, produce analysis trustworthy enough that teams would use it to propose or kill targets within days rather than weeks?
The goal is not to replace expertise. It is to provide synthesized biology, competitive intelligence, financial context, and regulatory landscape so meetings become decision-making sessions rather than information-sharing sessions.
Every figure, claim, and source must be traceable and correct before any function will act on it.
The hypothesis
Precision is a function of specialization, not scale.
A single agent covering biology, clinical, financial, regulatory, and IP analysis would stretch its context window thin and lose depth. If each agent focused on one domain with tailored prompts, relevant context, and appropriate model selection, outputs would be more reliable and verifiable.
The body doesn't deploy one universal immune cell against every pathogen. Rather, it coordinates dozens of specialized immune cell types, each recognizing and eliminating specific antigens through distinct mechanisms. A generalist LLM trained on the entire internet filters through massive noise to find signal.
But specialization alone isn't enough. Every claim needs a verifiable source. I designed for an audit trail that allows verification within 30 seconds.
This led to deliberate model selection
Claude
for Biology, Clinical, RegulatoryConstitutional AI training produces strict formatting compliance and conservative interpretation. 200K token context window ingests full study protocols.
Perplexity
for Patent, Market ResearchReasoning engine over live web and patent databases. Bridges natural language queries to Boolean patent syntax.
Gemini
for Financial AnalysisMillion-plus token window holds a complete 10-K with market reports. Native multimodal capability parses pipeline charts and cap tables from images.
The experiment
To test whether specialized agents could deliver on this, I built a three-agent system and ran it against a project I had worked on previously, where ground truth existed for validation.
I scoped the first version to three agents: Target Biology, Clinical, Patent Expert. Only after validating this core loop did I add Regulatory, Market, and Financial agents.
Success criteria
- Citation protocol: Is every claim traceable?
- Factual accuracy: Are patent numbers and PubMed links real?
- Reasoning quality: Does the analysis reflect domain expert thinking?
I accepted 30-60 second response times for chain-of-thought reasoning and self-correction loops.
The results
Three full queries, each validated against prior work. Manual citation review, paper verification, numerical spot-checks.
What worked
Telling agents what not to do reduced fabricated citations more than elaborating desired outputs. Adding explicit acceptance criteria (e.g., "Do not cite a source unless the link resolves") dropped error rates further.
What failed
Analysis stayed surface-level. Adequate for first-pass summary, not for paid analyst work. On one target, Sonny missed a novel mechanism of action generating significant interest from major players, which would have changed the competitive assessment. Models lack training on recent research, which limits utility for competitive intelligence.
The original design synthesized findings into recommendations ("proceed with the $2.2B acquisition"). I removed this. Only the acquiring company knows its internal metrics and strategic priorities. I removed the recommendation function. The system surfaces information; decision-making stays with the team.
The pivot: from advisor to analyst
The system's role shifted from advisor to intelligence analyst.
Before: Sonny resolved ambiguity. Clinical Agent says "Promising," Patent Agent says "Crowded," Sonny weighs and recommends.
After: Sonny exposes ambiguity. "Financial Agent projects 2026 launch. Regulatory Agent identifies 12-month delay risk." The system flags the gap rather than resolving it.
I redesigned the orchestration logic based on early results showing that forced consensus often produced hallucinated compromises.
Architectural changes
Layout-aware parsing
I implemented layout-aware parsing after text extraction destroyed table structure. Converting PDFs to Markdown preserves headers and row alignment.
Hybrid search
I added hybrid search. Semantic search finds "PD-1 inhibitors." Keyword search forces exact matches for "US Patent 10,123,456."
Critic agent
I added a critic agent that runs after each response with a verification prompt: "Does every claim have a citation from provided context? If not, delete." This catches unsupported claims before output.
Tone normalization
I built a tone normalization step that strips value-laden adjectives ("impressive," "disappointing"). It rewrites prose into flat observations. Every sentence preserves source metadata.
Architecture (roadmap)
The diagram below maps the target architecture for Sonny 2.0. Solid outlines mark capabilities that are live today; dashed outlines mark planned enhancements including complexity-aware routing, persistent memory, quality gates, and self-improving feedback loops.
Sonny 2.0 Architecture
ROADMAPMulti-Agent Orchestration ยท Biotech Due Diligence
Key Design Principles
Complexity-aware routing minimizes latency for simple queries. Quality gates with citation verification enforce zero-hallucination output. Feedback loops with max retry bounds default to UNKNOWN rather than fabricate. Persistent memory enables workspace-level continuity across sessions.
What I learned & open questions
Layout-aware parsing should have been implemented from the start
Treating PDFs as text strings flattened table structure, which caused most of the early numerical errors.
Design for ambiguity earlier
My initial goal was clean, unified answers. In practice, forcing consensus often produced hallucinated compromises. Flagging contradictions proved more useful than resolving them.
Source hierarchy matters in retrieval
Early versions cited press releases over peer-reviewed studies when the press release parsed more easily. I added weighting so PMID-verified papers rank above news snippets.
Most errors traced back to input data, not model reasoning
Improving document parsing and source selection had more impact than prompt refinement.
Can an LLM detect what's missing?
In drug development, omissions often matter more than disclosures. A company may present 4-week efficacy data without mentioning the 8-week results, or highlight tumor response while omitting liver enzyme elevations. The signal often lives in what's missing.
Disclosure requirements vary by stage. Clinical trials mandate comprehensive reporting of adverse events. Preclinical programs don't. A target can quietly disappear from a pipeline due to toxicity, strategic reprioritization, or simple resource constraints, and no explanation is required. The data package won't tell you which.
Most models are built to summarize what's present. They excel at compressing evidence, but struggle to recognize silence. I'm not yet convinced an agent can reliably flag suspicious omissions without the kind of intuition that comes from watching programs advance, stall, and fail over years. That intuition may be the hardest thing to encode, and the most valuable thing to retain.
Limitations & future considerations
Model evolution
These model selection choices reflect the landscape as of late 2025. LLMs will only improve, and the tradeoffs will shift.
Notably, improvement doesn't uniformly reduce risk. A smarter model can produce more dangerous hallucinations because they become reasoned hallucinations: the model can explain why it generated a fabricated figure, making the error sound incredibly convincing.
This raises the bar for verification, not lowers it. The architecture must account for the possibility that the tools get sharper but the failure modes get subtler. This is what I'm watching most closely as the models evolve.