Unknown Unknowns in MAVE Data

2026-04-15 bioinformatics

Thoughts from this year's Mutational Scanning Symposium

But there are also unknown unknowns – the ones we don’t know we don’t know. – Donald Rumsfeld¹

I’ve just finished up at this year’s Mutational Scanning Symposium which was three days of incredible dense information and also getting to meet a lot amazing friendly people including the MAVEdb team who I’ve been working with for years but haven’t met in person until now.

It left me with a lot to think about², too much to write up all at once, and for some of the specifics I’ll have to wait for the videos to come up on the Atlas of Variant Effects YouTube Channel but I’ve got to start somewhere and one thing came up as a recurring theme …

The VUS Crisis

This year seemed to be quite clinically oriented and several speakers mentiond the “VUS Crisis” … basically, we’ve discovered a lot of genetic variants, and while we’ve determined that some of them are pathogenic and some benign, that leaves a hell of a lot of variants in neither category, aka “variants of unknown significance”.

Significance

MAVEs don’t provide enough evidence on their own, but there are Standards and Guidelines on how they can assist in classifying variants. A high or low score in a MAVE can provide supporting evidence of pathogenicity or benignity, or vice-versa (it depends what you’re measuring). In general this is done by picking thresholds beyond which the score is regarded as significant.

ExCALIBR³ provides a methodology for setting these thresholds and also the introduction is a good introduction to the concept of supporting evidence. ExCALIBR’s output provides both an estimate of pathogenicity and a strength of evidence based on Bayesian stats.

In search of mediocrity

I’m particularly interested in hypomorphic variants, which are variants which produce an effect on a gene which reduces its effectiveness but doesn’t completely remove it. Hypomorphic variants can be tricky to spot experimentally because negative feedback within the cell can mean that a gene which produces a protein which is half as effective just gets expressed more to compensate.

But this isn’t true for all genes and also the inefficient expression may have deleterious effects on the whole organism even if individual cells do okay, so these variants are still interesting clinically.

Our current scoring and classification system doesn’t really clarify between “we don’t have enough evidence to say if this variant is benign or pathogenic” and “we know this variant is hypomorphic”, and I think this is worth further investigation.

The first question I’m working on: is there evidence of these variants in existing experimental data, such as that found in MAVEdb?

Signal vs. Noise

Experiments generally have multiple replicates, and where we have scores for each replicate we can look at how those scores are distributed. Variants with widely distributed scores might end up with a middling average score, but we can tell from the standard deviation that this is meaningless.

If we have $ N $ scores $ x_i $ , we can work out the sample mean $ \bar{x} $ and sample variance $ s^2 $ :

$$ \bar{x} = \frac{1}{N} \sum{x_i} $$

$$ s^2 = \frac{1}{N-1} \sum(x_i - \bar{x})^2 $$

What we’re looking for is variants where $ \epsilon < \bar{x} - s $ and $ \bar{x} + s < 1 - \epsilon $ .

$ \epsilon $ is a bit of a fudge here, just some arbitrary limit. It’d probably be better to use the standard deviation of nonsense variants as the low limit, and the standard deviation of synonymous variants as the high limit. Really I’m just looking for anything whose distribution is centered and not too broad.

TO BE CONTINUED

Better known for other work ↩
Unfortunately, also COVID. ↩
Not to be mistaken for ExCalibR or ExcalibR or ExCalibr or EXCALIBR … researchers, do not name your new calibration tool after the famous sword. ↩