Background
In this post, I’m continuing a series focused on establishing base rates for data availability in key biological domains. Previously, I examined the volume and growth rate of scRNA-seq data. By extrapolating growth trends, I estimated the future volume of scRNA-seq data using both conservative and optimistic models. I also applied NLP scaling laws to estimate a 'compute optimal' model size for single-cell models given the available data volume.
Today, my goal is to quantify the publicly available, raw genomic sequence information, rates of growth, and important considerations in collecting and modeling this data (sequencing costs, genomic diversity, junk DNA, etc). I hope my commentary helps anchor expectations for biological data in the age of AI: where there is value now, and where it could emerge in the future.
Sequencing data
The Sequence Read Archive (SRA) serves as a global repository for raw sequencing reads contributed by researchers worldwide. It stores unprocessed short DNA or RNA sequences produced by high-throughput sequencing technologies, which rapidly sequence large genomes by breaking them into smaller fragments and sequencing them in parallel.
Raw sequencing data contains considerable redundancy, as overlapping reads capture the same information multiple times. This redundancy is essential for genome assembly, where computational methods align and merge these overlapping reads to reconstruct the original genome sequence. But it also means that simply counting raw sequencing reads can significantly overestimate the amount of unique information.
To better understand the magnitude of this redundancy, we can look at a recent preprint by Chikhi et al., Logan: Planetary-Scale Genome Assembly Surveys Life’s Diversity. Using cloud computing, they performed genome assembly across nearly all of the 27 million datasets in the SRA. After assembly, the “50 petabases of SRA raw data become 384 terabytes of compressed contigs, a 130x reduction in size.” While this comparison is a bit confusing since it involves different units—base pairs (50 petabases = 5e16 base pairs) versus storage size (384 terabytes = 3e14 bytes)—we can still conceptualize the final compression as roughly 130x (1 base pair ≈ 1 byte).
This level of compression highlights the vast redundancy in raw sequencing data. Essentially, for every 130 nucleotides in the original raw data, the deduplicated data stores the equivalent of just one nucleotide. Keep this in mind as you interpret the following graphs, which do not account for this redundancy.
Composition
So, how many raw genomic sequencing reads are in the SRA? By my estimate, approximately 7e10 megabases (Mb), or 7e16 base pairs (bp) (Figure 1). This aligns with the Logan estimate, with the time difference likely accounting for the additional data in my calculation (Logan's snapshot is from December 2023).
Whole genome sequencing (WGS) constitutes a large part of the dataset (Figure 2). However, WGS includes many nucleotides of limited significance. While the non-coding genome has essential regulatory elements like enhancers and silencers, much of the 'dark genome'—including repetitive sequences and transposable elements—lacks functional information. In contrast, whole exome sequencing (WXS) targets only protein-coding regions, reducing non-informative data but missing critical regulatory sequences in non-coding areas.
While WGS captures the full spectrum of genetic variation, it's a broad approach that generates a lot of extra data, much of which may be unnecessary. Identifying every functional genomic element remains a major challenge, and continuing progress in this area will make sequence-based machine learning models in genomics far more efficient and effective. Currently, using WGS data in models has significant drawbacks: (1) it's expensive to collect, (2) training on long sequences is still unsolved, and (3) models must learn to filter out large amounts of irrelevant data
Generation rate
Next, let's examine the rate of genomic data generation. I applied both linear and exponential models to the data starting from 2015, and the trend appears to be roughly linear (Figure 3). With 8 billion people on Earth and each human genome containing 3.2 billion nucleotides, we've sequenced the equivalent of about 21.9 million humans, or just 0.27% of the global population. To sequence the equivalent of the entire human population (8 billion people, which translates to 2.56e19 bp or 2.56e13 Mb) would take an estimated 3,470 years at the current rate.
Now, I don’t expect that we need 8 billion comprehensively sequenced human genomes to make really impactful progress in the life sciences with AI tools. The question of ‘how much’ data is highly relevant. However, we know so little about biology that it’s difficult to benchmark rates of progress. Our evaluations are not only very weak, but critically, very slow.
In my view, the slow pace of evaluation cycles in life sciences is a critical but often overlooked issue. Consider the contrast: while life science experiments can take weeks or months to produce results, evaluating the accuracy and utility of a generative text model takes only milliseconds for a human user.
Of course, the SRA includes data from a wide range of organisms. In the near term, evolutionary diversity is crucial for developing ML systems that can generalize across a wide range of tasks. Deep, organism- and modality-specific datasets will be needed to enhance predictive performance for particular applications. For now, broader genomic diversity is a rising tide that lifts all boats.
NLP anchors
As in my previous post, let's return to the NLP analogy. While we can't draw conclusions due to the distinct differences between these domains, this anchor may help frame the sheer volume of genomics data and its growth.
Consider Common Crawl—a web-scale corpus. It’s somewhat analogous to raw sequencing data: we have broad statistics on web page content, but only a fraction is truly useful information (see the stats page). Common Crawl is estimated to contain trillions of text tokens (let’s say 5e12).
If we do a rough calculation: 5e12 tokens / 8e9 people = 625 tokens per person. This is a very crude estimate, as the distribution of tokens per person is complex. But compare this to the information content from biology: 3.2e9 base pairs per person.
My point is that we’ve made significant strides with machine learning in a small, rate-limited information domain. Meanwhile, an immense amount of biological information is out there, waiting to be measured.
We need way cheaper ways to measure organic information. I had a rosy outlook about this until I started the following analysis. What originally colored my optimism was this familiar graph
The typical interpretation is holy crap, measuring DNA is getting really cheap, really fast. Looking more closely though, a better interpretation is, holy crap measuring DNA between 2007-2014 got much cheaper, really fast (this was due to 2nd gen sequencing technology). Now look at the rate from 2015-2022: in relative terms it’s back on pace with Moore’s Law, if not lagging a bit.
To potentially 'transform healthcare with AI,' we need a massive reduction in measurement costs. At the current pace, we might reach the 1-cent genome by 2054—30 years from now. That timeline might be acceptable for level-headed experts, but I doubt it aligns with public expectations or the level of investment currently flowing into the industry.
The most impactful advances in life sciences may still still come from innovations in how we measure biology. Unlocking answers to the most complex biological questions with AI depends on a scale of data that we may not have for some time, save for technology leaps.
My key takeaways
The amount of available data is insufficient considering the functional complexity of biology. Sequencing alone isn't enough; we also need comprehensive functional annotations.
Genomic data collection is prohibitively expensive, and we're far from a scalable solution—this could constrain the near- and medium-term impact of AI on life sciences, especially considering public expectations.
There are significant opportunities to improve how we measure biological systems, both by advancing technology and by identifying what to measure.
Realigning public expectations can still leave room for significant progress.