The state of research on virtual cell modeling
Measuring where we are on the emerging grand challenge
The measure of a method is its use.1 How do modern virtual cell models score on this rubric? If you count the papers, citations, and Github stars, there is a flurry of activity in the space. However, I believe that most users who adopt these tools with a specific task in mind reach a similar conclusion: published methods are still squarely a line of research, not yet ready for application.
Structure of the problem
To set the stage, what exactly do I mean by “virtual cell”? Simulating a cell's control logic has emerged as a compelling "grand challenge" for the applied bioML community, following breakthroughs on the protein folding problem. This challenge now appears tractable, given the accelerating pace of cell-wide (omics) data generation and progress in learning methods.
Virtual cell modeling (VCM) has predominantly focused on omics data, particularly single-cell RNA sequencing (scRNA-seq), due to its remarkable scalability. However, as Anshul Kundaje recently reminded the community, equating VCM progress solely with scRNA-seq gene expression prediction presents a misleading metric for the virtual cell challenge.
More specifically, the problem possesses greater structural depth: scRNA-seq (or more broadly, "omics") prediction primarily addresses trans-gene regulatory modeling. The equally crucial counterpart is cis-gene regulatory modeling. "Cis," meaning "close," refers to how proximal genomic elements influence outcomes such as gene expression or chromatin structure. "Trans," meaning "distant," describes how more remote genomic elements affect these outcomes.
Anshul argues that our progress on the virtual cell challenge is incompletely defined if we only measure success in trans-gene regulatory prediction. His group has made significant strides in cis-gene regulatory modeling, and just last week, AlphaGenome achieved another major breakthrough in this area. By all accounts, solutions to the cis-modeling problem are more advanced than those for the trans-modeling problem.
Where things stand
So where are we on trans-VCM? Earlier I made the claim that current methods fall short by measures of utility. It’s really hard to gauge how well these models work just by reading the literature—there is a huge diversity of metrics in use, commonly with overstated claims satisfying a “methodological breakthrough” narrative.
My oral history of the field is that there was a phase of lots of transformer pretraining methods with scant benchmarks, a reckoning of these approaches against simple baselines, and now there is a rush to propose and refine a litany of different metrics.
I want to talk about metrics, and specifically, how they reflect a method’s utility.
As I was drafting this, a new paper was shared that I think is a very important milestone for moving the field out of a mucky slowdown. This paper fits really well with the sentiments of this post, and basically does a rigorous evaluation of why typically used metrics fail, and how to account for this. I will talk more specifically about these results following my qualitative descriptions of what defines “utility” for an inference method.
Prediction utility
This idea of prediction utility is fuzzy but I think most folks will agree with my intuitions. In general, it captures, “measuring a method by its use.” I’ll lay out a few axes that I think are important for prediction utility, and then compare this to what’s the most successful case of applied ML in bio: protein structure prediction.
Strong absolute magnitude of performance. This is raw, quantitative performance. Not relative differences from other methods, or metrics that obfuscate shortcomings.
Confidence and error bounding. This should quantify where the model has high/low certainty about a prediction, in turn informing downstream use.
Direct link to an intervention. The prediction should suggest what to do next, a direct action to take.
AlphaFold’s solution to structure prediction checked the first two boxes. Community-driven benchmarks played a big role here—and no doubt took time to mature. I think we’re seeing something similar begin to happen in trans-VCM, and over time the bar will rise in service of the bigger vision.
It was also really important for AlphaFold to provide users with confidence measures. For example pLDDT was an important new metric; such confidence scores have high utility in guiding user decision making. These are missing from trans-VCM, but I can imagine such confidence measures reported on a per-gene, per-pathway and per-profile basis will increase prediction utility.
Finally, the third criteria is emergent from the feature set itself. And I think this is where the challenge of trans-VCM really differs from protein structure prediction. A 3D protein structure is directly actionable in today’s drug discovery pipeline. Structure-based drug design has a long history and a relatively mature playbook, and structure prediction fits right into it.
Don’t get me wrong, trans-gene expression prediction has plenty of high-utility applications. But systems biology doesn’t come with the same kind of canonical downstream steps. I think this points to a larger design space in what is a new approach to drug discovery.
Related to this is the nature of the data itself. Omics profiles often have tens of thousands of features per sample—an overwhelming number of variables that are nearly impossible for humans to directly reason about. That’s a stark contrast with the compact, spatially grounded outputs of structure models. Not only are the number of variables much smaller, but the nature of the 3D output is extremely well-suited for humans to reason about. We can look at a pocket and imagine what type of chemical molecule may fit into it. Of course it’s not that easy, but the prediction output has highly utility in regard to direct interventions. This is in part due to human cognitive biases that make this a more interpretable feature space.
Current state of research
Over the last year, there has been a strong focus by the community on critically looking at the data we have available to work with. Core questions are being asked about the replicability of experimental methods, and what our expectations should be when we ask a model to learn on data with high sampling variance. Typically, these findings are couched in the presentation of a new method. But I think the gold nuggets are in the experiments around data quality and the metrics we are using to understand performance.
Originally I planned to share a few recent papers and then describe the metrics, rather than the methods, they focus on refining. I think this has been the common undercurrent of research. However, as I was drafting this, Bo Wang and the Shift Bioscience team shared what I believe is a highly-impactful paper investigating the pitfalls of current trans-VCM metrics. They offer proposals for (1) new evaluation metrics that explicitly account for biased baselines, and (2) weighting schemes designed to guide models toward learning the most salient signals—while also providing a more meaningful measure of model performance.
I want to highlight a few of these core results. As an aside, the recent papers I took a lot of notes on and are worth reading (particularly towards mapping the landscape metrics/benchmarks) are: TRADE (Harvard/MIT), TxPert (Recursion), PRESAGE (Genentech) and State (Arc).
The paper Diversity by Design: Addressing Mode Collapse Improves scRNA-seq Perturbation Modeling on Well-Calibrated Metrics explains why simple mean-baselines often match or outperform complex deep learning models in scRNA-seq perturbation prediction. This anomaly is attributed to metric artifacts: specifically, control-references and unweighted error metrics that inadvertently reward mode collapse when the control is biased or the biological signal is sparse.
Two core phenomena contribute to this issue:
Systematic control bias: Using non-targeting control (NTC) cells as the reference for delta perturbation profiles can lead to a biased control population (a “delta” profile is commonly used to represent the difference of a perturbed profile from a reference: ∆up = up – uNTC). This bias artificially inflates the performance of mean-baselines on metrics like Pearson(∆) because the general effect of any perturbation (mean of all perturbations, ∆uALL = uALL – uNTC) can dominate over the unique effect of a specific one, making each perturbed profile appear highly correlated with the mean of all perturbations (Fig. 1a). Pearson correlation is also really limited because it does not consider the scale of differences between a prediction and groundtruth, only the linearity.
Signal dilution: Genetic perturbations typically induce changes in a small subset of genes, leading to sparse true biological differences. MSE treats all features equally, causing signal dilution and inadvertently rewarding models that predict the dataset mean (mode collapse) rather than capturing crucial, low-dimensional biological changes. In other words, most gene expression patterns do not change as a result of a genetic perturbation. Models are rewarded for just predicting the typical “mode” of each feature, as this is a good prediction for the vast majority of genes (maybe >99% of 20k features). However, the “high utility” features are a small set of DEGs.
To address these limitations, the authors introduce usage of the perturbation-mean profile (uALL) as the reference control, and DEG–aware metrics:
Weighted mean-squared error (WMSE) and weighted delta R2 (R2w(∆)). These metrics are sensitive to “niche” perturbation-specific signals (ie, small sets of DEGs) and measure error with respect to the mean of all perturbed cells in the dataset (uALL), rather than non-transduced controls. This crucial change eliminates control bias and “ensures the prioritized genes are the ones that make that perturbation unique from all the others”.
WMSE: basically weights MSE by each gene’s t-score (derived from a standard significance testing framework), using the mean of all perturbed profiles as the reference. This focuses the metric / loss on DEG prediction errors.
R2w(∆): basically computes R2 between the difference of a perturbation and mean of all perturbations (up – uALL), with the option to weight features (ie, focus on DEGs). Most importantly, R2 is much more stringent than Pearson, as it penalizes both bias and scaling errors, not just correlation. While Pearson reflects how well predictions track the trend of the data, R2 measures how close they are in absolute terms, making it better suited to the grand challenge of virtualization.
They also introduce negative and positive performance (a technical duplicate) baselines to properly calibrate these metrics, allowing for a more transparent assessment of model performance. This reveals that the mean baseline indeed falls to null performance under these calibrated metrics, and a technical duplicate provides an upper bound of prediction performance expectations.
Crucially, the paper demonstrates that WMSE can directly replace MSE as a training loss function. When a model like GEARS was retrained with WMSE loss, it showed substantial improvement, recovering more of the true perturbation variance instead of shrinking towards the dataset mean, thus reducing mode collapse. This improvement was observed even in zero-shot prediction tasks, providing early evidence that DEG-based weighting provides a generalization advantage by steering optimization towards sparse, high-variance predictions that better reflect perturbation effects we care about (utility!).
In conclusion, the paper proposes a four-step remedy: (i) use the mean of all perturbed cells to remove systematic control bias in delta and DEG calculations; (ii) adopt DEG-score weighted metrics (R2w(∆), WMSE); (iii) calibrate all metrics with negative, null, and positive baselines; and (iv) implement DEG-aware optimization objectives like WMSE.
The mode-collapse phenomenon has been apparent for a while, and this study does a great job investigating exactly why. Reassuringly, the reasons are very interpretable–and actionable. The conclusions toward focusing metrics on signals that have downstream impact (DEGs) align with my description of prediction utility.
My (optimistic) takeaway is that current methods aren't necessarily bad, but our measuring sticks have been. Some simple changes may greatly improve the many available methods, particularly when compared against well-calibrated baselines.
This paper was fortunate timing for my write-up, as I think it plants a new flag that will help the community move on from a slowdown. I expect more metrics will be explored based on these findings, and the community will speed up to coalesce around a few interpretable metrics that allow us to evaluate method utility in real time. I’m also looking forward to work on measuring model prediction confidence that is relevant to the particularities of how we use high-dimensional omics data. Finally, I’m curious to see previous methods evaluated under well-calibrated metrics and, of course, new methods to come.
Thanks to Clayton Mellina for the helpful comments and review.
Credit to my advisor, Michael Keiser, for this consistent advice.




