Press "." to enter slideshow, once entered "PgUp" and "PgDn" to navigate, "Esc" to leave slideshow again.

The Primordial Code

You are running on an operating system a couple of billion years old which is full of primordial libraries, monkey patches, self-modifying code, viral hacks and even containers running a different operating system.

Content Warnings
Mention of infectious and genetic diseases,
including cancer, but nothing gruesome.

Background - Me


Nick Moore
nick.zoic.org

Consultant: software development, systems architecture.

mnemote.com

Photo: Charlotte Moore

My name's Nick Moore, I'm a consultant, working in software development and systems architecture. I've been working with computers for a very long time and Python for a fair proportion of that.

Previously



Bioinformatician?

I've previously presented at PyConAU and related conferences on MicroPython, Internet of Things, Visual Programming, Functional Programming in Javascript, NoSQL in Postgres.

Recently I've been working in bioinformatics. Bioinformatics is the study of biological systems using numerical analysis. Biological systems are pretty complex, so typically this analysis requires computers. I'm been lucky enough to get involved in that part.

Thanks & Apologies

Thanks to my colleagues at
WEHI (Walter and Eliza Hall Institute of Medical Research),
University of Washington Genome Sciences
and the Brotman Baty Institute.

           

Mistakes and oversimplifications are all mine!

Thanks to my colleagues at Walter and Eliza Hall Institute of Medical Research and at University of Washington Genome Sciences and at the Brotman Baty Institute for their patience while I've found my feet in this field.

Mistakes and oversimplifications are all mine!

Slides, Notes, Errata


https://nick.zoic.org/PYCON25/

Bioinformatics — 1

1833Charles Babbage's Analytical Engine (not completed)
1842Punched Tapes for telegraphy
1866Gregor Mendel identifies "discrete inheritable units"
1858Charles Darwin publishes his theory of natural selection
1869DNA isolated (an enormously long molecule, purpose unknown)
1870T. H. Huxley "Biogenesis and Abiogenesis"

Bioinformatics is the numerical study of biological systems. Since biological systems tend to be pretty complicated, this involves computers.

Now you've got two problems.

A lot of the foundations of both fields were laid in the 19th century. Gregor Mendel worked out that inheritence worked using discrete inheritable units, he didn't use the word 'gene' but that's the idea. While Charles Babbage and Ada Lovelace were working on the idea that machines could perform calculations, Charles Darwin and Thomas Huxley were working on natural selection and how life could possibly have gotten started.

In the meantime, the DNA molecule got discovered, an enormously long molecule with no known purpose. We'll come back to that shortly.

Bioinformatics — 2

1913Morgan & Sturtevant "genetic linkage maps" show linear structure of genetics.
1927Nikolai Koltsov proposes "giant hereditary molecule" (but runs afoul of his government)
1936Alan Turing's "universal computing machines" which use an (infinite) tape.
1937Claude Shannon & foundation of digital computing.
1940Bletchley Park "Bombe" for cryptanalysis
1940sFrederick Sanger: Protein Sequencing
1952Alan Turing works on morphogenesis (but runs afoul of his government)
During the 20th century, researchers worked out that some genes were more closely coupled than others, and this indicated a linear layout of genes in some kind of linked structure.

Nikolai Koltsov proposed a giant molecule as the store of genetic information, but this didn't fit well with the Lysenkoist theories of Soviet Russia, landing him in hot water with his government and leading to his early death.

In the meantime, Alan Turing and Claude Shannon worked out theoretical and practical bases for computing, leading to the invention of cryptanalysis and the cracking of the Nazi Enigma codes. After the war, Turing would turn his attention to morphogenesis, the way organisms grow complex structures, but also ended up getting government attention and an untimely death.

Proteins were known to be composed of a long string of amino acids, and thought to be something to do with inheritance. Frederick Sanger earns a Nobel Prize for working out how to read these sequences.

Bioinformatics — 3

1944DNA as the carrier of genetic information
1953Watson, Crick and Franklin discover the structure of DNA
1970sFrederick Sanger: DNA sequencing
2005DNA editing with CRISPR
2015DNA therapy
2019in-body gene editing

However, it is soon discovered that proteins are a later step in the process, and the true carrier of genetic information is DNA.

Watson, Crick and Rosalind Franklin work out the structure of DNA and unperturbed, Frederick Sanger works out how to read DNA sequences, earning another Nobel Prize.

In the 2000s, we work out how to edit DNA, eventually culminating in editing DNA in living beings to cure genetic diseases.

4,000,000,000 years of Biology in 10 minutes

“Many were increasingly of the opinion that they’d all made a big mistake in coming down from the trees in the first place.
And some said that even the trees had been a bad move, and that no one should ever have left the oceans.”


― Douglas Adams, The Hitchhiker's Guide to the Galaxy
Let's just quickly review the last 4 billion years of biology.

Cell

Images: prokaryote: Ali Zifan CC BY-SA 4.0, via Wikimedia Commons
animal & plant): Ladyofhats, Public Domain, via Wikimedia Commons.

The basic unit of life is the cell.

A cell is a bit like a tiny computer.

Inside there's a bunch of programs called genes which tell the cell how to make proteins, which are the working machinery of the cell.

The cell is bounded by a membrane which keeps the insides in and the outsides out, and the cell interacts with the world through channels which are kind of like ports which selectively let specific molecules in and out of the cell.

Bacteria, animals and plant cells are a bit different, but they have these features in common.

DNA

Image: Zephyris, CC BY-SA 3.0, via Wikimedia Commons

Genes, the programs of the cell, are encoded as *chromosomes*, very long *DNA* molecules not unlike a tape. DNA is built up out of four "bases", Adenine, Cytosine, Guanine and Thymine, generally just abbreviated as A C G and T.

"base pairs"

basecomplement
AT
CG
GC
TA

4 Mbp = 8 Mbit = 1 MB

They zip together in pairs, A complements T and C complements G, so we generally refer to this smallest piece of genetic information as a "base pair". This is the unit we'll use to compare genome sizes. Because there's four possible pairs each "base pair" is equivalent to 2 binary bits of information. So "1 million base pairs" that means 2 megabits of information.

DNA Sequencing — 1

OrganismBase PairsGenes
1977Bacteriophage ΦX1745 kbases﹡11
1997Escherichia coli4.6 Mbp4288
1996Brewer's yeast
Saccharomyces cerevisiae
12 Mbp6275
2000Fruit Fly
Drosophila melanogaster
120 Mbp15k
2003Human
Homo sapiens
3 Gbp20k
﹡ It's a single stranded DNA virus which infects bacteria, so it's not really "base pairs"
Bacteria have relatively small genomes, typically a single circular chromosome of hundreds of thousands through to a million or so base pairs. The human genome by contrast has about 3 billion base pairs, and each of us have two copies spread over multiple chromosomes.

DNA Sequencing — 2

Image: Abizar at English Wikipedia, CC BY-SA 3.0, Link

We're a lot more complicated than a bacteria, but there's a species of lungfish with 130 billion base pairs and an amoeba with 670 billion base pairs.
So who's counting?

DNA Sequencing — 3

Image: National Human Genome Research Institute

Genes and Evolution

OriginalACAGAGCAGGTGGCCCTGg.=
SubstitutionACAGTGCAGGTGGCCCTGg.5A>T
InsertionACAGAGCCGATAGGTGGCCCTGg.7_8insCGAT
DeletionACAGAGCTAGGGGCCCTGg.8delAGG
DuplicationACAGAGCAGGTGGCCAGGTGGCCCTGg.8_15dup

Unicellular organisms reproduce by copying, and errors arise during the copying process, sometimes leading to novel features in the cells. Substitions, Insertion, Deletion, Duplication and more unusual errors like inversion cause changes in the DNA.

These changes can cause a gene to be shortened or lengthened or split into pieces or fused with another gene. Bigger errors can also result in multiple copies of a gene appearing, and then these multiple copies can evolve in different directions. Changes can lead to parts of the genome effectively being commented out.

On top of this, pesky things like retroviruses can introduce new genes entirely, effectively writing themselves into a cell's genome to get the cell to make more retroviruses.

It's pretty rare that any of these changes are helpful. Most often, they just result in a broken gene. But over vast timescales, the helpful changes add up. Good features thrive, bad features dwindle, and that's natural selection, leading to evolution.

Horizontal Gene Transfer

Jonasz Patkowski, CC BY-SA 4.0, via Wikimedia Commons

Bacteria also exchange genes with different bacteria or even other organisms, by passing *plasmids*, which are like mini chromosomes. This is called *horizontal gene transfer*.

It may seem counter-intuitive to give away your code to your competitors, but consider: the thing evolution is optimizing here is not the organism, but the gene. If a gene which confers, say, antibiotic resistance gets copied into a new organism and that organism thrives, then the gene is successful even if the old organism is out-competed by the new. The *gene* will get passed on to more organisms.

Yep, Free Software is billions of years old, and there's been a whole lot of cutting and pasting from the primordial script archive.

The Central Dogma — 1

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

The "source code" of the genome is DNA, but DNA is just a stable storage for the genome, it doesn't actually **do** anything. Instead, DNA is *transcribed* into RNA, and then RNA is *translated* into proteins.
This is often called the "central dogma" of cellular biology.

The Central Dogma — 2

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

The Central Dogma — 3

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

There's also cell replication to consider. Retroviruses can translate RNA back into DNA so we'd better include that There's also *non-coding RNA* which functions directly rather than being translated into a protein first.

The Central Dogma — 4

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

Transcription from DNA to RNA is done by a protein complex called *RNA polymerase (RNAP)*. This little machine unzips the DNA and assembles an RNA molecule, but RNAP is itself made of proteins.
Proteins are produced from genes, so to build the transcription mechanism we first have to run the transcription mechanism ...
There's also *splicing* and *translation* to consider. DNA is transcribed into RNA but that RNA isn't in it's final form, first it needs to be *spliced*. This is done by, yep, more ncRNA and more proteins. It's a really complicated process,
So RNA is self-modifying code, and one piece of RNA can affect the way another piece is expressed, the proteins it produces.

Bootstrapping

Images: boot: Auckland Museum, CC BY 4.0, via Wikimedia Commons
boot disk: archive.org (who as well as an image of a boot disk have a boot disk image)

Before we can compile the compiler, we need to compile the compiler.
In computing, we call this *bootstrapping*.
You start off by using very primitive tools, possibly even a pencil, to create a very simple first compiler, and then using that compiler you can build a more sophisticated compiler, and so on.
All of this only works because the parent cell contained enough of these mechanisms to get the whole process started. Kind of a boot disk. Those molecules can then make more of themselves, and the cell continues to run.
You can think of these genes, as an "operating system" which the rest of a cell's biology is implemented on top of.
There's no documentation or source control, but by looking at the common features of all cellular life we can hypothesize a "univeral common ancestor" which arose about 4 billion years ago, and from which all life is descended.
ATP Synthase

Image: PDB-101

I've been talking about proteins as little machines, and just to make sure you don't think this is some kind of metaphor, this is an animation of a protein complex called ATP synthase.

Translation

1st2nd3rd
UCAG
UUUUPheUCUSerUAUTyrUGUCysU
UUCUCCUACUGCC
UUALeuUCAUAASTOPUGASTOPA
UUGUCGUAGUGGTrpG
CCUULeuCCUProCAUHisCGUArgU
CUCCCCCACCGCC
CUACCACAAGlnCGAA
CUGCCGCAGCGGG
AAUUIleACUThrAAUAsnAGUSerU
AUCACCAACAGCC
AUAACAAAALysAGAArgA
AUGSTART / MetACGAAGAGGG
GGUUValGCUAlaGAUAspGGUGlyU
GUCGCCGACGGCC
GUAGCAGAAGluGGAA
GUGGCGGAGGGGG
Translation from RNA to Protein isn't simple either. A protein is a long chain of amino acids, which is built up by a complex molecular machine called a *ribosome*, built from ncRNA and proteins.
The mapping from RNA to protein is done by *transfer RNA (tRNA)*. Groups of three bases called *codons* correspond to different tRNAs which each bring an amino acid molecule to add onto the protein.
This table shows the "standard code" which maps codons to amino acids. Not all organisms have the exact same code: this table isn't a law of nature. It's a result of what tRNA happen to be around, and tRNA is produced from the genome, so this translation is happening "in software".
It somewhat resembles the instruction decoding tables used in microprocessors — there's some redundancy where 64 codons translate to 20 amino acids. Translation starts at a "start" codon and finishes when it reaches a "stop" codon.
But there's no looping or branching or I/O or whatever, so how is this like a program?

Gene Regulation — 1

Image: Thomas Shafee, CC BY 4.0, via Wikimedia Commons

Well, genes are not expressed equally. When genes are packed into a chromosome, that's kind of like a library. There's some header information before and after each gene, known as the "Regulatory Sequences". These affect the way RNA Polymerase attaches to DNA, and how Ribosomes attach to RNA. Genes, or groups of genes, can be promoted or suppressed. Splicing can be suppressed or altered to produce different proteins.
All this is under the control of the molecules within the cell, which themselves exist through the action of other genes or from external stimuli.

Gene Regulation — 2

if lactose and not glucose:
    make_enzymes()

Gene Regulation — 3

Image: "Gold-standard gene regulatory network #1"(CC-A-4.0)

Regulation is not always by external stimuli: genes can regulate other genes, positively or negatively, in what are called Gene Regulatory Networks.
If each gene is a statement, the cell's "program" is found in the interaction between those statements. And those interactions are extremely complicated.

Symbiogenesis

Image: Chiswick Chap. Redrawn from File:Symbiogenesis.svg to reflect more recent science., CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

At some point about 3 billion years ago, a particularly entrepreneurial branch of the Archaea developed a more sophisticated internal cell structure. These are the *Eukaryotes*.

Eukaryote Cell

mitochondria

Image: LadyofHats, Public domain, via Wikimedia Commons

Animals (including humans) are Eukaryotes. Also plants, fungi, algae and slime molds.

The Eukaryotic genome is protected inside a *nucleus*, and specialist *organelles* perform specific functions within the cell. They're linked together by the *cytoskeleton*, a network of filaments and tubes which spans the inside of the cell.

You can think of these as peripheral controllers or coprocessors supporting the main processor, linked by buses. DNA to RNA transcription occurs in the nucleus, and then RNA to protein translation occurs in the *endoplasmic reticulum* surrounding the nucleus.

Yep, we've just added a separation between kernel and userland programs.

Also, some of these organelles, the *mitochondria*, appear to once have been independent Proteobacteria, which were engulfed by the proto-Eukaryotes and instead of being destroyed they were put to work.

Mitochondria are like little containers with their own separate genome. They have their own DNA outside the nuclear DNA. They have their own replication, transcription and translation mechanisms. The cell benefits from their ability to produce ATP, and the mitochondria benefit from the cell's protection.

But they're effectively running their operating system, like little containers within the host cell. They do their own RNA translation, using their own tRNA, and they happen to use a slightly different translation code to the host cell. They've been engulfed but they're still separate and running their own OS.

Plant Cell

mitochondria and chloroplasts

Image: LadyofHats, Public domain, via Wikimedia Commons

We like to think of vertebrates as the pinnacle of evolution but plants have done this twice! In addition to harnessing proteobacteria as mitochondria, plants have harnessed photosynthetic cyanobacteria as *chloroplasts*.

You're probably familiar with the phrase "Embrace, Extend, Extinguish"[^eee] referring to the way proprietary systems can use and then destroy open ones. This may be happening here too: there's evidence that the functions of mitochondria are slowly migrating into the nuclear DNA and being lost from mitochondria.

Multicellular Life

Most (not all) Eukaryotes are multicellular, an organism is made up of many, many cells. Every cell in an organism has the same DNA, the same programming, and what role it ends up performing depends on how it is specialized.
You can think of a the DNA as a container image which gets specialized at runtime. The cells in your kidneys contain all the instructions needed to be a brain cell, they just don't use the brain-specific ones, and vice-versa.
The cells in a multicellular organism each have their own separate existence but also communicate by exchanging molecules. The multicellular organism is a network!
Even weirder, some organisms like slime moulds and some bacteria can form multicellular colonies which act somewhat like a multicellular organism but which can also break apart into unicellular life, depending on conditions.

Applications

Everyone complains about the laws of physics,
but no one does anything about them.

— Greg Egan, Schild’s Ladder

Genetic Disorders

Gene Protein Function Symptoms
HBB β-globulin Blood oxygen transport Sickle-cell anemia
G6PD G6PD Anti-oxidant Haemolytic anemia
F9 Factor IX Blood clotting Haemophilia B
BRCA1 BRCA1 Tumor supression Increased cancer susceptibility

G6PD — 1

SequenceDNA variantProtein variantType
ACAGAGCAGGTGGCCCTG g.1G>A p.Ala1Thr missense
CCAGAGCAGGTGGCCCTG g.1G>C p.Ala1Pro missense
GAAGAGCAGGTGGCCCTG g.2C>A p.Ala1Glu missense
GCAAAGCAGGTGGCCCTG g.4G>A p.Glu2Lys missense
GCACAGCAGGTGGCCCTG g.4G>C p.Glu2Gln missense
GCAGAACAGGTGGCCCTG g.6G>A p.Glu2= synonymous
GCAGACCAGGTGGCCCTG g.6G>C p.Glu2Asp missense
GCAGAGAAGGTGGCCCTG g.7C>A p.Gln3Lys missense
GCAGAGTAGGTGGCCCTG g.7C>T p.Gln3Ter nonsense
GCAGAGCAAGTGGCCCTG g.9G>A p.Gln3= synonymous
GCAGAGCACGTGGCCCTG g.9G>C p.Gln3His missense
GCAGAGCAGATGGCCCTG g.10G>A p.Val4Met missense
GCAGAGCAGCTGGCCCTG g.10G>C p.Val4Leu missense
GCAGAGCAGGAGGCCCTG g.11T>A p.Val4Glu missense
GCAGAGCAGGCGGCCCTG g.11T>C p.Val4Ala missense
GCAGAGCAGGGGGCCCTG g.11T>G p.Val4Gly missense
GCAGAGCAGGTAGCCCTG g.12G>A p.Val4= synonymous

G6PD — 2

Saccharomyces cerevisiae

Mogana Das Murtey and Patchamuthu Ramasamy, CC BY 3.0, via Wikimedia Commons

G6PD — 3

Image: Nick Moore

G6PD — 4

Image: Nick Moore, unpublished preliminary results

G6PD — 5

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

G6PD — 6

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

G6PD — 7

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

Debugging — 1

print("Hello, World!")

Debugging — 2

Image: Elektor Magazine "Debugging without a debugger"

Debugging — 3

Image: Queensland Brain Institute

Debugging — 3

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

mRNA & Spam — 1

Ηο𝚝 Ꮮսхս𝚛у ꓪа𝚝сℎ℮ꜱ
ıո γοᴜⲅ ⍺𝖗ҽа

Thanks to Unicode Confusables

mRNA & Spam — 2

nucleosidemodified nucleoside
uridine (U)pseudouridine (Ψ)
5-methyluridine (m5U)
2-thiouridine (s2U)
adenosine (A)N6-methyladenosine (m6A)
cytidine (C)5-methylcytidine (m5C)

Katalin Karikó

1980sResearch into dsRNA
1985Move to USA
1990sResearch into mRNA
1997Collaboration with Drew Weissman
2005Pseudouridine to prevent immune response
2013VP BioNTech RNA Pharmaceuticals
2020BioNTech/Pfizer and Moderna mRNA vaccines for COVID-19
2023Nobel Prize in Physiology and Medicine (with DW)

Thanks!

<blink> Politics </blink>
?

Image: Mogana Das Murtey and Patchamuthu Ramasamy, CC BY 3.0, via Wikimedia Commons