Biology-Native Data Infrastructure Powers AI Drug Discovery

Why Drug Development Needs Better Data

Drug development has always been a costly, time-intensive process. The path from target identification to a clinical candidate still takes over five years in most cases. Nearly 90% of drugs that enter clinical trials ultimately fail. Moreover, R&D costs per approved therapy continue to double every nine years, driven by the growing complexity of treatment modalities and the sheer volume of hypotheses that teams must evaluate.

Fortunately, AI is beginning to change these odds. Between 2012 and 2022, approximately 200 companies that leveraged AI for drug discovery collectively raised $18 billion. Today, we are seeing those investments bear fruit in the clinic.

In June 2025, Insilico Medicine published positive Phase IIa results for rentosertib — the first drug where both target discovery and molecule design relied entirely on generative AI. Their team nominated a preclinical candidate after screening just 78 molecules, compared to the thousands typically required. They achieved this in 18 months and at less than 10% of the average cost per approved drug. Consequently, large pharma has taken notice: GSK committed $50 million upfront to NOETIK, while Eli Lilly agreed to pay a mid-eight-figure annual fee to Chai Discovery for biologics design access.

By 2024, over 350 biological AI models had been published — including AlphaFold3, ESM3, Boltz-1, and scGPT — reflecting AI’s expanding reach across protein design, genomics, and pathology analysis. The Cambrian explosion of biology AI models has already arrived. Therefore, the competitive edge in drug development will no longer come from access to models alone. Instead, it will come from the data infrastructure that feeds them.

Three Core Principles of Biology-Native Infrastructure

Principle 1: Biology-Native Data at Scale

Much of the training data behind today’s AI biology models was assembled over decades of publicly funded science. The Protein Data Bank (PDB) contains over 200,000 protein structures. The Human Genome Project mapped the full human genome. ChEMBL compiled bioactivity data on millions of small molecules. Structural data from the PDB contributed to 100% of protein-targeted small-molecule cancer drugs approved by the FDA between 2019 and 2023.

However, these databases have significant limitations. The PDB skews toward proteins that are stable and easy to crystallize. Meanwhile, membrane proteins and intrinsically disordered proteins — among the most compelling oncology targets — remain severely underrepresented. Furthermore, PDB structures are static snapshots. They freeze proteins in a single conformation rather than the dynamic shapes they adopt inside a living cell.

Beyond structural gaps, more than two-thirds of drug development time goes to steps after early discovery. ADME studies, formulation optimization, immunogenicity testing, and clinical safety evaluations all demand data for which large, high-quality public datasets simply do not yet exist. Additionally, patient-level omics profiles linked to treatment outcomes remain siloed across hospital systems and biopharma databases. As a result, training models to predict which patients will respond to a therapy before trial enrollment remains largely out of reach.

To unlock AI’s true potential in drug development, companies must invest in two areas: generating novel multi-modal biological measurements and building datasets with the scale, consistency, and context that modern AI models require.

Companies leading here include:

Peptone — combining biophysics with supercomputing to generate data on intrinsically disordered proteins
Inductive Bio — assembling one of the industry’s largest ADMET datasets
NOETIK — pairing tumor multi-omics with longitudinal treatment outcomes in oncology
Prima Mente — building whole-genome epigenetic models applied to brain disease

Principle 2: Agentic AI Across R&D Workflows

While drug development costs have risen steadily, computing costs have fallen exponentially since the 1950s. Tasks that are computationally expensive today will cost a fraction as much within a few years. Thus, companies that build flexible, modular AI infrastructure from day one will hold a decisive structural advantage over those anchored to a fixed stack.

A decade ago, building proprietary molecular modeling tools in-house was a genuine differentiator. Today, structure predictors, ADMET models, and molecular dynamics simulators are widely accessible through both open-source and commercial platforms. Therefore, the smarter approach now is to strategically combine the best available tools rather than build everything from scratch.

Agentic AI makes this combinatorial approach scalable. AI agents can mine preprint servers, patent filings, and public biological databases simultaneously. They can surface non-obvious connections, generate novel hypotheses, design wet lab experiments, and write research reports — all while maintaining team-wide research context and a full experimental record.

Cheaper compute has also made long-context inference economically practical. A single AI agent run can now synthesize over 1,000 papers and 40,000 lines of code. Combined with chain-of-thought reasoning and multi-agent frameworks, this capability meaningfully compresses R&D timelines.

Companies building agentic R&D infrastructure include:

K-Dense and Edison Scientific — autonomous AI scientist platforms that execute long-horizon research workflows
Phylo — an integrated biology environment for seamless scientist-AI collaboration
Potato and Convoke — operating systems for biopharma spanning early discovery through regulatory commercialization workflows

Principle 3: Closed-Loop Lab Automation

Even the most advanced AI models depend on experimental data to validate their outputs. Binding affinity predictions, in vivo efficacy, pharmacokinetics, and toxicity profiles all require wet lab confirmation before any development decision can proceed confidently. Yet the experimental cycles connecting a model’s output to its next update often take weeks or months.

The iterative design-test-make-analyze loop in lead optimization alone can take up to three years. This bottleneck is further stretched when validation work gets outsourced to Contract Research Organizations (CROs), adding coordination overhead, queue times, and data quality inconsistencies to each cycle.

Historically, lab automation tools like Hamilton liquid handlers and Chemspeed synthesis platforms were optimized for specific high-throughput tasks, not end-to-end workflow integration. Most labs still require significant human intervention to transfer materials between instruments and interpret results. Natural language interfaces for robot control could change this. Scientists without robotics backgrounds could run, monitor, and iterate on experiments remotely. Vision-native systems can now autonomously read microscopy images of cells and feed structured data directly back into model pipelines.

Companies that close the loop between computational prediction and experimental feedback will compound their biological learning far faster than competitors relying on traditional CRO timelines.

Companies leading closed-loop automation include:

Medra — an instrument-agnostic robotics platform for general-purpose lab interaction
Automata — modular hardware and software connecting instruments into end-to-end automated workflows
Dash Bio — a faster, more automated CRO model
Lila Sciences — a fully automated lab for end-to-end drug discovery and development

Life Sciences Will Run on AI

The three principles of biology-native data infrastructure are not independent. They reinforce each other. Better data trains better models. Better models guide faster experiments. Faster experiments generate better data. Together, this flywheel defines how the next generation of leading life science companies will be built.

Companies that generate biology-native data at scale, deploy agentic AI across their full R&D stack, and adopt closed-loop lab automation will ultimately compress drug development timelines, reduce clinical failure rates, and deliver on AI’s long-standing promise for human health.