CountESS

Count-based Experiment Scoring and Statistics

Nick Moore
nick@zoic.org
@nickzoic@aus.social

Press "." to enter slideshow, once entered "PgUp" and "PgDn" to navigate, "Esc" to leave slideshow again.

Funding

This work is supported by WEHI (Walter and Eliza Hall Institute of Medical Research)

Thanks also to collaborators from the Brotman Baty Institute
and
University of Washington Genome Sciences.

Content Warnings

This presentation briefly mentions:

... but doesn't go into gross medical details.

Bioinformatics in 9½ minutes

First up, Bioinformatics in 9½ minutes

Bioinformatics in 9½ minutes

[*] Approximately

A lot of this will be pretty approximate too so don't sweat the details right now. If you're interested in the details let's catch up after this session over lunch.

Or for those of you watching online, you could do a lot worse than going down a Bioinformatics Wikipedia Rabbit Hole

Bioinformatics

There's some really interesting history here which would make a great historical research presentation in its own right.

References:

Cell Biology

Image: LadyofHats, Public domain, via Wikimedia Commons

There's a lot we still don't know about cell biology. The mechanisms by which all living things function and reproduce are complex and difficult to observe directly, since they are hidden inside living cells.

But we can construct experiments which let us observe the outward behaviour of cells and through analysis we can determine what's going on inside them. This is not totally unlike black-box testing of software: you might not be able to read the code, but if you can find a correlation between a change in input and a change in output, you can start to understand what the program does even if you can't directly watch it doing it.

The Human Genome

  • The "program" of your cells
  • 23 Chromosomes, each is a very long molecule
  • Each is 100M or so building blocks called "base pairs" or nucleotides
  • All of your cells have the same genome

Image: Zephyris, CC BY-SA 3.0, via Wikimedia Commons

The "program" of a cell is called its genome. The human genome is split into 23 chromosomes each of which is a DNA molecule, a chain of a hundred million or so little building blocks called base pairs or nucleotides. All in all, the human genome consists of about 12Gbit of information.

All of your cells have the same genome, like sharing a container image, but different genes are active in different cells, depending on their role in the body.

Nanopore Electrophoresis

Image: DataBase Center for Life Science (DBCLS), CC BY 4.0, via Wikimedia Commons

These days, we can actually read out an individual cell's entire genome using a technique called "nanopore electrophoresis", which draws the DNA molecule through a tiny hole like a small child slurping up spaghetti. And using techniques we've learned from studying bacteria and viruses, we can even edit DNA molecules and change or append base pairs. So we can read and write the binary, but that doesn't mean we understand the program.

CRISPR

Genes


Image: Thomas Shafee, CC BY 4.0, via Wikimedia Commons

We do know a bit about the syntax. Each chromosome contains many regions called genes, with each gene providing the instructions to make tiny blobby machines called proteins, which the cell uses to do its job. Each gene is wrapped up in more instructions which form a kind of header and footer and which regulate how and when the gene is used. Some genes are small, hundreds of base pairs long and producing a single protein and some are very large, a couple of million base pairs and encoding dozens of proteins.

Gene Names

source: NBCI


There are no debug symbols in the genome, so when we think we know what a gene does, we give it an arbitrary name. This process is familiar to anyone who's reverse engineered a binary, either with paper and pencil or with modern reverse-engineering tools.

We know of about 20,000 genes which encode proteins, but those constitute about 3% of the entire genome. We're pretty sure at least 80% of the genome does something, so we've got a ways to go.

Reference: "Encode Project Writes Eulogy For Junk DNA", Science Magazine.

Gene Variants

Single Nucleotide Polymorphism (SNP)

Single Nucleotide Variants (SNV)

Every person's genome is different. We have the concept of a "reference genome" which describes what we think the "normal" sequence for every gene is, but no-one's actually got that exact genome.

The most common kind of variant is a single base (nucleotide) change. If it's a common kind of difference, seen in more that 1% of the population, it's called a Polymorphism. Whereas if it's a rare change it's generally called a Variant. It means the same thing: one base is changed from the reference.

Gene Variants

Easily observable:

Some differences in genes lead to obvious, observable differences in the organism, such as ability to digest lactose as an adult, or hair colour. These tend to have a large variation in the population because they don't matter very much, or at least not in our current lives.

Because there's a big population to work with, it's possible to find a correlation between lactose intolerance and specific variants of genes LCT and MCM6.

References:

Gene Variants

... or to determine the likely hair colour of an ancient human from the DNA in her 5700 year old chewing gum.

Reference: Human Genome Recovered From 5,700-Year-Old Chewing Gum

Long COVID?

Source: MedRxiv / Lammi et. al. "Genome-wide Association Study of Long COVID", 2023-06-29

As another example, recent work published on MedRXiv suggests a link between specific variants of FOXP4 and severity of COVID infection as well as susceptibility to Long COVID.

See also: NPR Goats and Soda

Cancer


Other gene variations lead to more subtle differences, such as your likelihood of developing specific cancers. Cancers are caused by damage to DNA, perhaps due to chemicals or radiation or just bad luck. Most such damage triggers cellular defense mechanisms, which repair the DNA, or destroy the cell rather than allow it to become a tumor.

But we know clinically that some gene variants reduce the effectiveness of these defense mechanisms, greatly increasing the risk of developing specific cancers.

Because these pathogenic variants are rare, and people's lives are complicated, it's much harder to find them in among the background noise of benign variants and unrelated risk factors.

To put it in programming terms, we're looking for a bug in an exception handler in some critical legacy code which is live and exposed to the Internet.

See also: Rosen et al, "BRCA1 gene in breast cancer", 2003

Multiplex Assays of Variant Effect (MAVEs)

Image: A Rodent of Unusual Size, from The Princess Bride

Multiplex Assays of Variant Effect ... wait, wrong slide ...

Multiplex Assays of Variant Effect (MAVEs)

Clinical Approach

Clinical Cases
Gene Sequencing
Variant Classification

Multiplex Assays of Variant Effect ... right, so there's a few thousand known variants of these genes we've learned about from clinical cases.

But there's a lot of possible variants we don't know anything about, because we've never observed them clinically. Some of those are very likely harmless, and others might be very serious: this information can inform individual choices for the prevention or treatment of cancer.

Multiplex Assays of Variant Effect (MAVEs)

Experimental Approach

Clinical Cases Gene editing
Gene Sequencing Experimental Assay
Variant Classification

Multiplex Assays of Variant Effect (MAVEs) are an attempt to catalog not just the gene variants we've seen clinically but *all possible variants* of a gene, classifying their effects as more or less concerning, by contructing experiments which measure the variant effect.

The outcome is the same: variants classified by harmfulness, and the two approaches complement each other.

MaveDB


Image: MaveDB

As part of Friday's Django miniconf, my colleague Alan Rubin spoke about MaveDB, an online respository for results from MAVE experiments so if you missed that one check it on the conference videos.

I'm going to briefly introduce a couple of MAVE experimental techniques: Saturation Genome Editing and VAMP-seq, because they illustrate the problem we're trying to solve.

Saturation Genome Editing

Source: ncbi.nlm.nih.gov

This paper uses Saturation Genome Editing to analyze almost 4000 single-base variants of BRCA1. For each variant, the effect on cell growth is assessed by counting the variants in the cell population after 11 days. The results show good correlation with the variants we already know of clinically and also identifies new pathogenic variants. Similar work is underway to analyse other tumor suppressing genes such as BRCA2 and PALB2.

Saturation Genome Editing

https://sge.gs.washington.edu/BRCA1/

Source: UW / BBI

This chart shows the distribution of scores for different single nucleotide variants of BRCA1. The largest peak is around a score of 0, indicating that a lot of variants are pretty much harmless, but the other peak shows the variants which cause "loss of function", and thus increased risk of cancer developing.

VAMP-seq — 1

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

The VAMP-seq technique produces cells of each possible gene variant, like SGE, but also adds the sequence to create a fluorescent protein called EGFP, derived from a jellyfish protein.

This protein is useful for all kinds of experiments and is kind of the equivalent of an embedded software developer adding in a couple of diagnostic LEDs.

VAMP-seq — 2

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

The more the gene is expressed the more EGFP is produced, and the more the cell fluoresces. Cells are then "binned", mechanically sorted by how much they glow.

VAMP-seq — 3

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

Each bin is then sequenced, and the population of each variant in each bin is counted.

VAMP-seq — 4

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

This is then used to "score" each variant on how much protein is present.

VAMP-seq — 5

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

... and a heat map can be made showing the effect on the change of each amino acid at each position in the protein under investigation.

Because multiple variants are measured together, this is a more cost-effective experimental technique.

This technique is currently being used to investigate the F9 gene, which codes for a protein called Coagulation Factor IX which is essential for blood clotting. Variants of F9 can cause hemophilia.

That brings us to the next section: Counting Things

Counting Things

Image: Ben Franske (BenFranske), CC BY-SA 2.5, via Wikimedia Commons

Central to all these methods is the need to count things.

This sounds really simple, and it is. Computers have been counting things since the first tabulating machines started processing punched cards.

The Problem

From the SGE paper I just mentioned:

Data sizes only increasing ...

Data at NCBI

Let's have a look at the numbers from the SGE paper I mentioned earlier.

It's only a few hundred million sequences, a few billon base pairs all up.

The Simple And Obvious Solution

sort sequences.txt | uniq -cd

There's tools for that, and they're probably already on your computer. Sequences go in, counts go out, problem solved.

The Actual Problem

Of course it isn't as simple as that. Sequencing data isn't generally in text format, it's in FASTQ or FASTA or SAM or CRAM or SRA format.

Often there's metadata hanging around in columns, or in filenames or as separate metadata files.

And, this being biology, everything is squishy, you'll probably need to filter out low quality or bad reads.

Then you'll need to collect those counts together in some way, so you can compare them, calculate some ratios, come up with some scores, that kind of thing.

And finally, before you get too excited, you'll want to run some statistical tools over the data ...

Image: tragically, unknown.

... because it's important to check if you really did discover something or it's just noise.

Spreadsheets


Source: Biomedcentral (emphasis mine)

A lot of analysis work used to be done in spreadsheets, but in recent years data sizes have grown faster than spreadsheets have kept up. Plus there's a few foot-guns to be found in Excel's handling of gene names such as SEPT2 and alphanumeric identifiers with a single E in them.

Programming Languages

A lot of ad-hoc work happens in languages like Python and R, with support from various others like Bash, Make and Snakemake.

A fair bit of this code is pretty flakey and hard to replicate, we wanted to take a different approach.

Enrich2

A tool called "Enrich2" was popular for some specific experiments, but it was implemented in Python 2, proved difficult to upgrade to Python 3 and was limited to a specific set of experimental designs.

CountESS — Science!

Image: Valve Software (Half Life)

This is problematic because as technology advances and new techniques are invented scientists will keep coming up with new kinds of experiment!

CountESS

Count-based Experiment Scoring and Statistics

So we've developed CountESS, which is short for "Count-based Experiment Scoring and Statistics".

CountESS

It has a graphical user interface featuring a preview of your data as it is manipulated. It's plugin based so anyone can contribute functionality. It has a flexible, data flow graph based architecture. And all the configuration is held in a single file, which is easy to organize.

CountESS — Architecture

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries

This shows the general architecture, and now we'll step through each of these parts.

CountESS — GUI

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries

The GUI code is kept quite separate from the rest. At the moment, it's implemented in TkInter, which seemed like an expedient choice since it comes bundled with Python.

CountESS — GUI — Overall

It's not all that pretty but works passably well over X11 forwarding, which is nice.

There's a row of buttons along the top to load and save configuration,

There's a plugin graph, which lets you select and rearrange plugins. The currently selected plugin is highlighted, and on the right is a configuration pane and a data preview pane showing the state of the selected plugin.

CountESS — GUI — Plugin Tree

The plugin tree lets you connect plugins together graphically. This lets you have multiple inputs to a plugin, and also multiple outputs from a plugin. So you can use the plugin graph to join sources of data and also to process data in multiple ways.

This is pretty similar to the shader graphs in Unity or the Fusion tools in Davinci Resolve or, for that matter, LabView in the 1990s, which I talked about back at https://nick.zoic.org/art/decoding-programming-beyond-text-files/

The thing I always hate about these graph views is that they get messy quickly and inspire a lot of fidgeting about. I've implemented a "tidy graph" button which rearranges the graph to line up strata and space nodes evenly while keeping their relative order.

It could be improved by adding in force-directed placement ala GraphViz, but it's already good enough that I'm going to add in a "tidy mode" where this is just done automatically any time you move or add anything.

CountESS — GUI — Configuration

Configuration pane lets you configure each plugin, like filling in a form

The configuration pane lets you set up the configuration of each plugin, like filling in a form.

CountESS — GUI — Data Previews

Preview pane shows current output of plugin run on 100k rows

The preview pane shows the output of the plugin. While using the GUI to edit configurations, about 100,000 rows of your data are processed, with live updates as the configuration changes.

CountESS — CLI

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries
There's not too much to say about the CLI. The GUI loads and saves its configuration in a configuration file, and the CLI can read that file and run it, with diagnostics to standard output and standard error. This lets you set up your work in the GUI with live preview, and then run it possibly on a remote machine with no GUI available.

CountESS — Config Files

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries

CountESS — Pipeline

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries

CountESS — Plugins

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries

Three kinds of plugins:

Loosely, there are three kinds of plugins: input, transformer and output. Input plugins read from files and produce dataframes, transformer plugins combine and transform dataframes, and output plugins write dataframes back to disk in one form or another.

CountESS — Plugins — 2

Plugins subclass one of the many countess.core.plugins.BasePlugin subclasses:

from countess.core.plugins import PandasTransformDictToSinglePlugin

class MyNewPlugin(PandasTransformDictToSinglePlugin):

    parameters = {
        "a": ColumnChoiceParam("Column A"),
        "c": IntegerParam("Favourite Number", 42),
    }

    def process_dict(self, data: dict, logger: Logger) -> Optional[int]:
        return data[self.parameters['a'].value] // self.parameters['c'].value

(a simplified example)

CountESS — Plugins — 4

Your plugin's pyproject.toml
[project.entry-points.countess_plugins]
myplugin = "mypluginpackage:MyNewPlugin"

CountESS:
entry_points = importlib.metadata.entry_points().select(
    group="countess_plugins"
)

for ep in entry_points:
    try:
        plugin_class = ep.load()
        assert issubclass(plugin_class, BasePlugin)
    except Exception as exc:
        logger.exception(exc)

(modified for dramatic effect)

CountESS uses the Python "importlib" mechanism to find plugins, both built-in and third party. So anyone can create, distribute and use CountESS plugins.

CountESS — Libraries

GUI CLI
Pipeline Config Files
Plugins Plugins Plugins
Pandas / Numpy Other Libraries
CountESS plugins are mostly built over Pandas and Numpy functions, and/or over other Python libraries such as MiniMap2. If your function is expressible in Pandas it's easy to turn into a CountESS plugin.

CountESS — Example — 1

Okay so a quick example of how CountESS can solve experimental problems. This is a setup much like the SGE experiment, using the same BRCA1 data from that experiment.

That data is in 163 FASTQ files, so the first thing we do is load those up. This plugin can also count up the number of repeats of the same sequence in a file, so we'll do that as well.

CountESS — Example — 2

Then we just extract the run number, `Run`, from the header field.

CountESS — Example — 3

There's another file in with the data files, a CSV file called `SRaRunTable.txt`. This tells you which experiment each of those Runs belongs to, and which phase of the experiment.

CountESS — Example — 4

For the moment we'll just extract the exon number out of that ...

CountESS — Example — 5

... and then we can join that back to the input file, using the Run number. Now we know which exon each sequence belongs to.

CountESS — Example — 6

We know what the reference sequence is for each of these exons, so we'll put that in another file ...

CountESS — Example — 7

... so we can join that in with the sequences.

CountESS — Example — 8

Okay so some of these sequences are unchanged. We call those the "wildtypes". We can filter just those one way and keep those counts ...

CountESS — Example — 9

... and the ones that have changed, we'll separate those off ...

CountESS — Example — 10

... and call each of the variants, comparing the reference sequence to the variant sequence, recording the HGVS variant string.

CountESS — Example — 11

Then we can join this back to the wildtype counts ...

CountESS — Example — 12

... calculate a score ...

CountESS — Example — 13

... and finally save that off to a CSV file.

CountESS

So hopefully we've seen ...

Questions


https://nick.zoic.org/PYCON23/