A lot of this will be pretty approximate too so don't sweat the details right now. If you're interested in the details let's catch up after this session over lunch.
Or for those of you watching online, you could do a lot worse than going down a Bioinformatics Wikipedia Rabbit Hole
References:
There's a lot we still don't know about cell biology. The mechanisms by which all living things function and reproduce are complex and difficult to observe directly, since they are hidden inside living cells.
But we can construct experiments which let us observe the outward behaviour of cells and through analysis we can determine what's going on inside them. This is not totally unlike black-box testing of software: you might not be able to read the code, but if you can find a correlation between a change in input and a change in output, you can start to understand what the program does even if you can't directly watch it doing it.
The "program" of a cell is called its genome. The human genome is split into 23 chromosomes each of which is a DNA molecule, a chain of a hundred million or so little building blocks called base pairs or nucleotides. All in all, the human genome consists of about 12Gbit of information.
All of your cells have the same genome, like sharing a container image, but different genes are active in different cells, depending on their role in the body.
These days, we can actually read out an individual cell's entire genome using a technique called "nanopore electrophoresis", which draws the DNA molecule through a tiny hole like a small child slurping up spaghetti. And using techniques we've learned from studying bacteria and viruses, we can even edit DNA molecules and change or append base pairs. So we can read and write the binary, but that doesn't mean we understand the program.
We do know a bit about the syntax. Each chromosome contains many regions called genes, with each gene providing the instructions to make tiny blobby machines called proteins, which the cell uses to do its job. Each gene is wrapped up in more instructions which form a kind of header and footer and which regulate how and when the gene is used. Some genes are small, hundreds of base pairs long and producing a single protein and some are very large, a couple of million base pairs and encoding dozens of proteins.
There are no debug symbols in the genome, so when we think we know what a gene does, we give it an arbitrary name. This process is familiar to anyone who's reverse engineered a binary, either with paper and pencil or with modern reverse-engineering tools.
We know of about 20,000 genes which encode proteins, but those constitute about 3% of the entire genome. We're pretty sure at least 80% of the genome does something, so we've got a ways to go.
Reference: "Encode Project Writes Eulogy For Junk DNA", Science Magazine.
Every person's genome is different. We have the concept of a "reference genome" which describes what we think the "normal" sequence for every gene is, but no-one's actually got that exact genome.
The most common kind of variant is a single base (nucleotide) change. If it's a common kind of difference, seen in more that 1% of the population, it's called a Polymorphism. Whereas if it's a rare change it's generally called a Variant. It means the same thing: one base is changed from the reference.
Some differences in genes lead to obvious, observable differences in the organism, such as ability to digest lactose as an adult, or hair colour. These tend to have a large variation in the population because they don't matter very much, or at least not in our current lives.
Because there's a big population to work with, it's possible to find a correlation between lactose intolerance and specific variants of genes LCT and MCM6.
References:
... or to determine the likely hair colour of an ancient human from the DNA in her 5700 year old chewing gum.
Reference: Human Genome Recovered From 5,700-Year-Old Chewing Gum
As another example, recent work published on MedRXiv suggests a link between specific variants of FOXP4 and severity of COVID infection as well as susceptibility to Long COVID.
See also: NPR Goats and Soda
Other gene variations lead to more subtle differences, such as your likelihood of developing specific cancers. Cancers are caused by damage to DNA, perhaps due to chemicals or radiation or just bad luck. Most such damage triggers cellular defense mechanisms, which repair the DNA, or destroy the cell rather than allow it to become a tumor.
But we know clinically that some gene variants reduce the effectiveness of these defense mechanisms, greatly increasing the risk of developing specific cancers.
Because these pathogenic variants are rare, and people's lives are complicated, it's much harder to find them in among the background noise of benign variants and unrelated risk factors.
To put it in programming terms, we're looking for a bug in an exception handler in some critical legacy code which is live and exposed to the Internet.
Multiplex Assays of Variant Effect ... wait, wrong slide ...
Multiplex Assays of Variant Effect ... right, so there's a few thousand known variants of these genes we've learned about from clinical cases.
But there's a lot of possible variants we don't know anything about, because we've never observed them clinically. Some of those are very likely harmless, and others might be very serious: this information can inform individual choices for the prevention or treatment of cancer.
Multiplex Assays of Variant Effect (MAVEs) are an attempt to catalog not just the gene variants we've seen clinically but *all possible variants* of a gene, classifying their effects as more or less concerning, by contructing experiments which measure the variant effect.
The outcome is the same: variants classified by harmfulness, and the two approaches complement each other.
As part of Friday's Django miniconf, my colleague Alan Rubin spoke about MaveDB, an online respository for results from MAVE experiments so if you missed that one check it on the conference videos.
I'm going to briefly introduce a couple of MAVE experimental techniques: Saturation Genome Editing and VAMP-seq, because they illustrate the problem we're trying to solve.
This paper uses Saturation Genome Editing to analyze almost 4000 single-base variants of BRCA1. For each variant, the effect on cell growth is assessed by counting the variants in the cell population after 11 days. The results show good correlation with the variants we already know of clinically and also identifies new pathogenic variants. Similar work is underway to analyse other tumor suppressing genes such as BRCA2 and PALB2.
This chart shows the distribution of scores for different single nucleotide variants of BRCA1. The largest peak is around a score of 0, indicating that a lot of variants are pretty much harmless, but the other peak shows the variants which cause "loss of function", and thus increased risk of cancer developing.
The VAMP-seq technique produces cells of each possible gene variant, like SGE, but also adds the sequence to create a fluorescent protein called EGFP, derived from a jellyfish protein.
This protein is useful for all kinds of experiments and is kind of the equivalent of an embedded software developer adding in a couple of diagnostic LEDs.
The more the gene is expressed the more EGFP is produced, and the more the cell fluoresces. Cells are then "binned", mechanically sorted by how much they glow.
Each bin is then sequenced, and the population of each variant in each bin is counted.
This is then used to "score" each variant on how much protein is present.
... and a heat map can be made showing the effect on the change of each amino acid at each position in the protein under investigation.
Because multiple variants are measured together, this is a more cost-effective experimental technique.
This technique is currently being used to investigate the F9 gene, which codes for a protein called Coagulation Factor IX which is essential for blood clotting. Variants of F9 can cause hemophilia.
That brings us to the next section: Counting Things
Central to all these methods is the need to count things.
This sounds really simple, and it is. Computers have been counting things since the first tabulating machines started processing punched cards.
Let's have a look at the numbers from the SGE paper I mentioned earlier.
It's only a few hundred million sequences, a few billon base pairs all up.
There's tools for that, and they're probably already on your computer. Sequences go in, counts go out, problem solved.
Of course it isn't as simple as that. Sequencing data isn't generally in text format, it's in FASTQ or FASTA or SAM or CRAM or SRA format.
Often there's metadata hanging around in columns, or in filenames or as separate metadata files.
And, this being biology, everything is squishy, you'll probably need to filter out low quality or bad reads.
Then you'll need to collect those counts together in some way, so you can compare them, calculate some ratios, come up with some scores, that kind of thing.
And finally, before you get too excited, you'll want to run some statistical tools over the data ...
... because it's important to check if you really did discover something or it's just noise.
A lot of analysis work used to be done in spreadsheets, but in recent years data sizes have grown faster than spreadsheets have kept up. Plus there's a few foot-guns to be found in Excel's handling of gene names such as SEPT2 and alphanumeric identifiers with a single E in them.
A lot of ad-hoc work happens in languages like Python and R, with support from various others like Bash, Make and Snakemake.
A fair bit of this code is pretty flakey and hard to replicate, we wanted to take a different approach.
A tool called "Enrich2" was popular for some specific experiments, but it was implemented in Python 2, proved difficult to upgrade to Python 3 and was limited to a specific set of experimental designs.
This is problematic because as technology advances and new techniques are invented scientists will keep coming up with new kinds of experiment!
So we've developed CountESS, which is short for "Count-based Experiment Scoring and Statistics".
It has a graphical user interface featuring a preview of your data as it is manipulated. It's plugin based so anyone can contribute functionality. It has a flexible, data flow graph based architecture. And all the configuration is held in a single file, which is easy to organize.
This shows the general architecture, and now we'll step through each of these parts.
The GUI code is kept quite separate from the rest. At the moment, it's implemented in TkInter, which seemed like an expedient choice since it comes bundled with Python.
It's not all that pretty but works passably well over X11 forwarding, which is nice.
There's a row of buttons along the top to load and save configuration,
There's a plugin graph, which lets you select and rearrange plugins. The currently selected plugin is highlighted, and on the right is a configuration pane and a data preview pane showing the state of the selected plugin.
The plugin tree lets you connect plugins together graphically. This lets you have multiple inputs to a plugin, and also multiple outputs from a plugin. So you can use the plugin graph to join sources of data and also to process data in multiple ways.
This is pretty similar to the shader graphs in Unity or the Fusion tools in Davinci Resolve or, for that matter, LabView in the 1990s, which I talked about back at https://nick.zoic.org/art/decoding-programming-beyond-text-files/
The thing I always hate about these graph views is that they get messy quickly and inspire a lot of fidgeting about. I've implemented a "tidy graph" button which rearranges the graph to line up strata and space nodes evenly while keeping their relative order.
It could be improved by adding in force-directed placement ala GraphViz, but it's already good enough that I'm going to add in a "tidy mode" where this is just done automatically any time you move or add anything.
The configuration pane lets you set up the configuration of each plugin, like filling in a form.
The preview pane shows the output of the plugin. While using the GUI to edit configurations, about 100,000 rows of your data are processed, with live updates as the configuration changes.
Loosely, there are three kinds of plugins: input, transformer and output. Input plugins read from files and produce dataframes, transformer plugins combine and transform dataframes, and output plugins write dataframes back to disk in one form or another.
CountESS uses the Python "importlib" mechanism to find plugins, both built-in and third party. So anyone can create, distribute and use CountESS plugins.
Okay so a quick example of how CountESS can solve experimental problems. This is a setup much like the SGE experiment, using the same BRCA1 data from that experiment.
That data is in 163 FASTQ files, so the first thing we do is load those up. This plugin can also count up the number of repeats of the same sequence in a file, so we'll do that as well.
Then we just extract the run number, `Run`, from the header field.
There's another file in with the data files, a CSV file called `SRaRunTable.txt`. This tells you which experiment each of those Runs belongs to, and which phase of the experiment.
For the moment we'll just extract the exon number out of that ...
... and then we can join that back to the input file, using the Run number. Now we know which exon each sequence belongs to.
We know what the reference sequence is for each of these exons, so we'll put that in another file ...
... so we can join that in with the sequences.
Okay so some of these sequences are unchanged. We call those the "wildtypes". We can filter just those one way and keep those counts ...
... and the ones that have changed, we'll separate those off ...
... and call each of the variants, comparing the reference sequence to the variant sequence, recording the HGVS variant string.
Then we can join this back to the wildtype counts ...
... calculate a score ...
... and finally save that off to a CSV file.