CountESS

Funding

This work is supported by WEHI (Walter and Eliza Hall Institute of Medical Research)

Thanks also to collaborators from the Brotman Baty Institute
and
University of Washington Genome Sciences.

First up, Bioinformatics in 9½ minutes

A lot of this will be pretty approximate too so don't sweat the details right now. If you're interested in the details let's catch up after this session over lunch.

Or for those of you watching online, you could do a lot worse than going down a Bioinformatics Wikipedia Rabbit Hole

Bioinformatics

Bioinformatics: analysis of biological data.
Arguably started in 1952:

Protein sequencing of insulin by Frederick Sanger
Alan Turing's "Chemical basis of morphogenesis"

Really got going as computers got invented:

Margaret Dayhoff's COMPROTEIN ~ 1960
Protein Information Resource (1984-)

Perhaps unsurprisingly, many parallels between biological science and computer science.

There's some really interesting history here which would make a great historical research presentation in its own right.

References:

There's a lot we still don't know about cell biology. The mechanisms by which all living things function and reproduce are complex and difficult to observe directly, since they are hidden inside living cells.

But we can construct experiments which let us observe the outward behaviour of cells and through analysis we can determine what's going on inside them. This is not totally unlike black-box testing of software: you might not be able to read the code, but if you can find a correlation between a change in input and a change in output, you can start to understand what the program does even if you can't directly watch it doing it.

The Human Genome

The "program" of your cells
23 Chromosomes, each is a very long molecule
Each is 100M or so building blocks called "base pairs" or nucleotides
All of your cells have the same genome

Image: Zephyris, CC BY-SA 3.0, via Wikimedia Commons

The "program" of a cell is called its genome. The human genome is split into 23 chromosomes each of which is a DNA molecule, a chain of a hundred million or so little building blocks called base pairs or nucleotides. All in all, the human genome consists of about 12Gbit of information.

All of your cells have the same genome, like sharing a container image, but different genes are active in different cells, depending on their role in the body.

Nanopore Electrophoresis

Reading DNA molecule one base at a time
Can also edit / append to DNA
Doesn't mean we understand everything ...

Image: DataBase Center for Life Science (DBCLS), CC BY 4.0, via Wikimedia Commons

These days, we can actually read out an individual cell's entire genome using a technique called "nanopore electrophoresis", which draws the DNA molecule through a tiny hole like a small child slurping up spaghetti. And using techniques we've learned from studying bacteria and viruses, we can even edit DNA molecules and change or append base pairs. So we can read and write the binary, but that doesn't mean we understand the program.

CRISPR

Genes

Genes code for proteins
Wrapped in headers and footers which regulate expression
Some small: one protein, hundreds of base pairs
Some huge: dozens of proteins, millions of base pairs

Image: Thomas Shafee, CC BY 4.0, via Wikimedia Commons

We do know a bit about the syntax. Each chromosome contains many regions called genes, with each gene providing the instructions to make tiny blobby machines called proteins, which the cell uses to do its job. Each gene is wrapped up in more instructions which form a kind of header and footer and which regulate how and when the gene is used. Some genes are small, hundreds of base pairs long and producing a single protein and some are very large, a couple of million base pairs and encoding dozens of proteins.

There are no debug symbols in the genome, so when we think we know what a gene does, we give it an arbitrary name. This process is familiar to anyone who's reverse engineered a binary, either with paper and pencil or with modern reverse-engineering tools.

We know of about 20,000 genes which encode proteins, but those constitute about 3% of the entire genome. We're pretty sure at least 80% of the genome does something, so we've got a ways to go.

Reference: "Encode Project Writes Eulogy For Junk DNA", Science Magazine.

Gene Variants

Single Nucleotide Polymorphism (SNP)

Single Nucleotide Variants (SNV)

Every person's genome is different
Reference genome
Most common kind of variant is a single base change
If it's common it's called an SNP
If it's rare it's called an SNV

Every person's genome is different. We have the concept of a "reference genome" which describes what we think the "normal" sequence for every gene is, but no-one's actually got that exact genome.

The most common kind of variant is a single base (nucleotide) change. If it's a common kind of difference, seen in more that 1% of the population, it's called a Polymorphism. Whereas if it's a rare change it's generally called a Variant. It means the same thing: one base is changed from the reference.

Some differences in genes lead to obvious, observable differences in the organism, such as ability to digest lactose as an adult, or hair colour. These tend to have a large variation in the population because they don't matter very much, or at least not in our current lives.

Because there's a big population to work with, it's possible to find a correlation between lactose intolerance and specific variants of genes LCT and MCM6.

References:

... or to determine the likely hair colour of an ancient human from the DNA in her 5700 year old chewing gum.

Reference: Human Genome Recovered From 5,700-Year-Old Chewing Gum

As another example, recent work published on MedRXiv suggests a link between specific variants of FOXP4 and severity of COVID infection as well as susceptibility to Long COVID.

Cancer

More subtle problems
Cell damage from radiation / chemicals / bad luck
Cell repair mechanisms (BRCA1, BRCA2, PALB2)
Variants can reduce effectiveness
Much harder to find because rare, and lives are complicated
Like a bug in an exception handler ...

Other gene variations lead to more subtle differences, such as your likelihood of developing specific cancers. Cancers are caused by damage to DNA, perhaps due to chemicals or radiation or just bad luck. Most such damage triggers cellular defense mechanisms, which repair the DNA, or destroy the cell rather than allow it to become a tumor.

But we know clinically that some gene variants reduce the effectiveness of these defense mechanisms, greatly increasing the risk of developing specific cancers.

Because these pathogenic variants are rare, and people's lives are complicated, it's much harder to find them in among the background noise of benign variants and unrelated risk factors.

To put it in programming terms, we're looking for a bug in an exception handler in some critical legacy code which is live and exposed to the Internet.

Multiplex Assays of Variant Effect ... wait, wrong slide ...

Multiplex Assays of Variant Effect (MAVEs)

Clinical Approach

Clinical Cases
↓
Gene Sequencing
↓
Variant Classification

A few thousand known variants of these genes
A lot more we don't know about because they'e never been seen clinically
Some might be very serious

Multiplex Assays of Variant Effect ... right, so there's a few thousand known variants of these genes we've learned about from clinical cases.

But there's a lot of possible variants we don't know anything about, because we've never observed them clinically. Some of those are very likely harmless, and others might be very serious: this information can inform individual choices for the prevention or treatment of cancer.

Multiplex Assays of Variant Effect (MAVEs)

Experimental Approach

Clinical Cases		Gene editing
↓		↓
Gene Sequencing		Experimental Assay
↓		↓
Variant Classification

Catalog all possible variants of a gene
Constructing experiments to score variant effect
Outcome is same: variants classified by harmfulness

Multiplex Assays of Variant Effect (MAVEs) are an attempt to catalog not just the gene variants we've seen clinically but *all possible variants* of a gene, classifying their effects as more or less concerning, by contructing experiments which measure the variant effect.

The outcome is the same: variants classified by harmfulness, and the two approaches complement each other.

As part of Friday's Django miniconf, my colleague Alan Rubin spoke about MaveDB, an online respository for results from MAVE experiments so if you missed that one check it on the conference videos.

I'm going to briefly introduce a couple of MAVE experimental techniques: Saturation Genome Editing and VAMP-seq, because they illustrate the problem we're trying to solve.

Saturation Genome Editing

Source: ncbi.nlm.nih.gov

Almost 4000 single-base variants of BRCA1
Effect on cell growth assessed after 11 days
Good correlation with variants we already know of
Identifies new pathogenic variants

This paper uses Saturation Genome Editing to analyze almost 4000 single-base variants of BRCA1. For each variant, the effect on cell growth is assessed by counting the variants in the cell population after 11 days. The results show good correlation with the variants we already know of clinically and also identifies new pathogenic variants. Similar work is underway to analyse other tumor suppressing genes such as BRCA2 and PALB2.

This chart shows the distribution of scores for different single nucleotide variants of BRCA1. The largest peak is around a score of 0, indicating that a lot of variants are pretty much harmless, but the other peak shows the variants which cause "loss of function", and thus increased risk of cancer developing.

The VAMP-seq technique produces cells of each possible gene variant, like SGE, but also adds the sequence to create a fluorescent protein called EGFP, derived from a jellyfish protein.

This protein is useful for all kinds of experiments and is kind of the equivalent of an embedded software developer adding in a couple of diagnostic LEDs.

VAMP-seq — 2

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

The more gene is expressed the more the cell glows
Cells are sorted mechanically by how much they glow

The more the gene is expressed the more EGFP is produced, and the more the cell fluoresces. Cells are then "binned", mechanically sorted by how much they glow.

Each bin is then sequenced, and the population of each variant in each bin is counted.

This is then used to "score" each variant on how much protein is present.

VAMP-seq — 5

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

... and made into a "heat map"
Cost effective technique
Currently being used to investigate Factor IX (hemophilia)

... and a heat map can be made showing the effect on the change of each amino acid at each position in the protein under investigation.

Because multiple variants are measured together, this is a more cost-effective experimental technique.

This technique is currently being used to investigate the F9 gene, which codes for a protein called Coagulation Factor IX which is essential for blood clotting. Variants of F9 can cause hemophilia.

That brings us to the next section: Counting Things

Central to all these methods is the need to count things.

This sounds really simple, and it is. Computers have been counting things since the first tabulating machines started processing punched cards.

The Problem

From the SGE paper I just mentioned:

13 exons
163 files
350 M sequences
66 G bases

Data sizes only increasing ...

Data at NCBI

Let's have a look at the numbers from the SGE paper I mentioned earlier.

It's only a few hundred million sequences, a few billon base pairs all up.

The Simple And Obvious Solution

sort sequences.txt | uniq -cd

Pretty well optimized
Bottleneck probably I/O
Already on your computer

There's tools for that, and they're probably already on your computer. Sequences go in, counts go out, problem solved.

Of course it isn't as simple as that. Sequencing data isn't generally in text format, it's in FASTQ or FASTA or SAM or CRAM or SRA format.

Often there's metadata hanging around in columns, or in filenames or as separate metadata files.

And, this being biology, everything is squishy, you'll probably need to filter out low quality or bad reads.

Then you'll need to collect those counts together in some way, so you can compare them, calculate some ratios, come up with some scores, that kind of thing.

And finally, before you get too excited, you'll want to run some statistical tools over the data ...

... because it's important to check if you really did discover something or it's just noise.

A lot of analysis work used to be done in spreadsheets, but in recent years data sizes have grown faster than spreadsheets have kept up. Plus there's a few foot-guns to be found in Excel's handling of gene names such as SEPT2 and alphanumeric identifiers with a single E in them.

A lot of ad-hoc work happens in languages like Python and R, with support from various others like Bash, Make and Snakemake.

A fair bit of this code is pretty flakey and hard to replicate, we wanted to take a different approach.

A tool called "Enrich2" was popular for some specific experiments, but it was implemented in Python 2, proved difficult to upgrade to Python 3 and was limited to a specific set of experimental designs.

This is problematic because as technology advances and new techniques are invented scientists will keep coming up with new kinds of experiment!

So we've developed CountESS, which is short for "Count-based Experiment Scoring and Statistics".

It has a graphical user interface featuring a preview of your data as it is manipulated. It's plugin based so anyone can contribute functionality. It has a flexible, data flow graph based architecture. And all the configuration is held in a single file, which is easy to organize.

This shows the general architecture, and now we'll step through each of these parts.

The GUI code is kept quite separate from the rest. At the moment, it's implemented in TkInter, which seemed like an expedient choice since it comes bundled with Python.

It's not all that pretty but works passably well over X11 forwarding, which is nice.

There's a row of buttons along the top to load and save configuration,

There's a plugin graph, which lets you select and rearrange plugins. The currently selected plugin is highlighted, and on the right is a configuration pane and a data preview pane showing the state of the selected plugin.

The plugin tree lets you connect plugins together graphically. This lets you have multiple inputs to a plugin, and also multiple outputs from a plugin. So you can use the plugin graph to join sources of data and also to process data in multiple ways.

This is pretty similar to the shader graphs in Unity or the Fusion tools in Davinci Resolve or, for that matter, LabView in the 1990s, which I talked about back at https://nick.zoic.org/art/decoding-programming-beyond-text-files/

The thing I always hate about these graph views is that they get messy quickly and inspire a lot of fidgeting about. I've implemented a "tidy graph" button which rearranges the graph to line up strata and space nodes evenly while keeping their relative order.

It could be improved by adding in force-directed placement ala GraphViz, but it's already good enough that I'm going to add in a "tidy mode" where this is just done automatically any time you move or add anything.

The configuration pane lets you set up the configuration of each plugin, like filling in a form.

The preview pane shows the output of the plugin. While using the GUI to edit configurations, about 100,000 rows of your data are processed, with live updates as the configuration changes.

CountESS — CLI

GUI

CLI

⇅

Pipeline

⇄

Config Files

⇅

Plugins

⇅

Pandas / Numpy

Other Libraries

GUI loads and saves configuration files
CLI loads and runs configuration file
Allows remote use without GUI available

There's not too much to say about the CLI. The GUI loads and saves its configuration in a configuration file, and the CLI can read that file and run it, with diagnostics to standard output and standard error. This lets you set up your work in the GUI with live preview, and then run it possibly on a remote machine with no GUI available.

CountESS — Pipeline

GUI

CLI

⇅

Pipeline

⇄

Config Files

⇅

Plugins

⇅

Pandas / Numpy

Other Libraries

This is where all the tricky stuff happens!
Multiprocessing to use CPU efficiently
Streaming data to use memory efficiently

CountESS — Plugins

GUI

CLI

⇅

Pipeline

⇄

Config Files

⇅

Plugins

⇅

Pandas / Numpy

Other Libraries

Three kinds of plugins:

Input Plugins
Transformer Plugins
Output Plugins

Loosely, there are three kinds of plugins: input, transformer and output. Input plugins read from files and produce dataframes, transformer plugins combine and transform dataframes, and output plugins write dataframes back to disk in one form or another.

CountESS — Plugins — 2

Plugins subclass one of the many countess.core.plugins.BasePlugin subclasses:

from countess.core.plugins import PandasTransformDictToSinglePlugin

class MyNewPlugin(PandasTransformDictToSinglePlugin):

    parameters = {
        "a": ColumnChoiceParam("Column A"),
        "c": IntegerParam("Favourite Number", 42),
    }

    def process_dict(self, data: dict, logger: Logger) -> Optional[int]:
        return data[self.parameters['a'].value] // self.parameters['c'].value

(a simplified example)

CountESS — Plugins — 4

Your plugin's pyproject.toml

[project.entry-points.countess_plugins]
myplugin = "mypluginpackage:MyNewPlugin"

CountESS:

entry_points = importlib.metadata.entry_points().select(
    group="countess_plugins"
)

for ep in entry_points:
    try:
        plugin_class = ep.load()
        assert issubclass(plugin_class, BasePlugin)
    except Exception as exc:
        logger.exception(exc)

(modified for dramatic effect)

CountESS uses the Python "importlib" mechanism to find plugins, both built-in and third party. So anyone can create, distribute and use CountESS plugins.

CountESS plugins are mostly built over Pandas and Numpy functions, and/or over other Python libraries such as MiniMap2. If your function is expressible in Pandas it's easy to turn into a CountESS plugin.

Okay so a quick example of how CountESS can solve experimental problems. This is a setup much like the SGE experiment, using the same BRCA1 data from that experiment.

That data is in 163 FASTQ files, so the first thing we do is load those up. This plugin can also count up the number of repeats of the same sequence in a file, so we'll do that as well.

Then we just extract the run number, `Run`, from the header field.

There's another file in with the data files, a CSV file called `SRaRunTable.txt`. This tells you which experiment each of those Runs belongs to, and which phase of the experiment.

For the moment we'll just extract the exon number out of that ...

... and then we can join that back to the input file, using the Run number. Now we know which exon each sequence belongs to.

We know what the reference sequence is for each of these exons, so we'll put that in another file ...

... so we can join that in with the sequences.

Okay so some of these sequences are unchanged. We call those the "wildtypes". We can filter just those one way and keep those counts ...

... and the ones that have changed, we'll separate those off ...

... and call each of the variants, comparing the reference sequence to the variant sequence, recording the HGVS variant string.

Then we can join this back to the wildtype counts ...

... calculate a score ...

... and finally save that off to a CSV file.

CountESS

So hopefully we've seen ...

A little bit of bioinformatics
Some interesting problems to solve
Some useful Python techniques
And maybe, a nifty approach to data processing ...

CountESS

Count-based Experiment Scoring and Statistics

Nick Moore nick@zoic.org @nickzoic@aus.social

Funding

Content Warnings

Bioinformatics in 9½ minutes

Bioinformatics in 9½ minutes

Bioinformatics

Cell Biology

The Human Genome

Nanopore Electrophoresis

Genes

Gene Names

Gene Variants

Single Nucleotide Polymorphism (SNP)

Single Nucleotide Variants (SNV)

Gene Variants

Gene Variants

Long COVID?

Cancer

Multiplex Assays of Variant Effect (MAVEs)

Multiplex Assays of Variant Effect (MAVEs)

Clinical Approach

Multiplex Assays of Variant Effect (MAVEs)

Experimental Approach

MaveDB

Saturation Genome Editing

Saturation Genome Editing

VAMP-seq — 1

VAMP-seq — 2

VAMP-seq — 3

VAMP-seq — 4

VAMP-seq — 5

Counting Things

The Problem

The Simple And Obvious Solution

The Actual Problem

Spreadsheets

Programming Languages

Enrich2

CountESS — Science!

CountESS

Count-based Experiment Scoring and Statistics

CountESS

CountESS — Architecture

CountESS — GUI

CountESS — GUI — Overall

CountESS — GUI — Plugin Tree

CountESS — GUI — Configuration

CountESS — GUI — Data Previews

CountESS — CLI

CountESS — Config Files

CountESS — Pipeline

CountESS — Plugins

CountESS — Plugins — 2

CountESS — Plugins — 4

CountESS — Libraries

CountESS — Example — 1

CountESS — Example — 2

CountESS — Example — 3

CountESS — Example — 4

CountESS — Example — 5

CountESS — Example — 6

CountESS — Example — 7

CountESS — Example — 8

CountESS — Example — 9

CountESS — Example — 10

CountESS — Example — 11

CountESS — Example — 12

CountESS — Example — 13

CountESS

Questions

Nick Moore
nick@zoic.org
@nickzoic@aus.social