Press "." to enter slideshow, once entered "PgUp" and "PgDn" to navigate, "Esc" to leave slideshow again.

The Primordial Code

You are running on an operating system a couple of billion years old which is full of primordial libraries, monkey patches, self-modifying code, viral hacks and even containers running a different operating system.

Content Warnings
Mention of infectious and genetic diseases, including cancer, but nothing gruesome.

Background - Me

Nick Moore
nick.zoic.org

Consultant: software development, systems architecture.

mnemote.com

Photo: Charlotte Moore

My name's Nick Moore, I'm a consultant, working in software development and systems architecture. I've been working with computers for a very long time and Python for a fair proportion of that.

Previously

MicroPython
Internet of Things
Visual Programming
Functional Programming in Javascript
NoSQL in Postgres

Bioinformatician?

I've previously presented at PyConAU and related conferences on MicroPython, Internet of Things, Visual Programming, Functional Programming in Javascript, NoSQL in Postgres.

Recently I've been working in bioinformatics. Bioinformatics is the study of biological systems using numerical analysis. Biological systems are pretty complex, so typically this analysis requires computers. I'm been lucky enough to get involved in that part.

Thanks & Apologies

Thanks to my colleagues at
WEHI (Walter and Eliza Hall Institute of Medical Research),
University of Washington Genome Sciences
and the Brotman Baty Institute.

Mistakes and oversimplifications are all mine!

Thanks to my colleagues at Walter and Eliza Hall Institute of Medical Research and at University of Washington Genome Sciences and at the Brotman Baty Institute for their patience while I've found my feet in this field.

Mistakes and oversimplifications are all mine!

Slides, Notes, Errata

https://nick.zoic.org/PYCON25/

Bioinformatics — 1

1833	Charles Babbage's Analytical Engine (not completed)
1842	Punched Tapes for telegraphy
1866	Gregor Mendel identifies "discrete inheritable units"
1858	Charles Darwin publishes his theory of natural selection
1869	DNA isolated (an enormously long molecule, purpose unknown)
1870	T. H. Huxley "Biogenesis and Abiogenesis"

Bioinformatics is the numerical study of biological systems. Since biological systems tend to be pretty complicated, this involves computers.

Now you've got two problems.

A lot of the foundations of both fields were laid in the 19th century. Gregor Mendel worked out that inheritence worked using discrete inheritable units, he didn't use the word 'gene' but that's the idea. While Charles Babbage and Ada Lovelace were working on the idea that machines could perform calculations, Charles Darwin and Thomas Huxley were working on natural selection and how life could possibly have gotten started.

In the meantime, the DNA molecule got discovered, an enormously long molecule with no known purpose. We'll come back to that shortly.

Bioinformatics — 2

1913	Morgan & Sturtevant "genetic linkage maps" show linear structure of genetics.
1927	Nikolai Koltsov proposes "giant hereditary molecule" (but runs afoul of his government)
1936	Alan Turing's "universal computing machines" which use an (infinite) tape.
1937	Claude Shannon & foundation of digital computing.
1940	Bletchley Park "Bombe" for cryptanalysis
1940s	Frederick Sanger: Protein Sequencing
1952	Alan Turing works on morphogenesis (but runs afoul of his government)

During the 20th century, researchers worked out that some genes were more closely coupled than others, and this indicated a linear layout of genes in some kind of linked structure.

Nikolai Koltsov proposed a giant molecule as the store of genetic information, but this didn't fit well with the Lysenkoist theories of Soviet Russia, landing him in hot water with his government and leading to his early death.

In the meantime, Alan Turing and Claude Shannon worked out theoretical and practical bases for computing, leading to the invention of cryptanalysis and the cracking of the Nazi Enigma codes. After the war, Turing would turn his attention to morphogenesis, the way organisms grow complex structures, but also ended up getting government attention and an untimely death.

Proteins were known to be composed of a long string of amino acids, and thought to be something to do with inheritance. Frederick Sanger earns a Nobel Prize for working out how to read these sequences.

Bioinformatics — 3

1944	DNA as the carrier of genetic information
1953	Watson, Crick and Franklin discover the structure of DNA
1970s	Frederick Sanger: DNA sequencing
2005	DNA editing with CRISPR
2015	DNA therapy
2019	in-body gene editing

However, it is soon discovered that proteins are a later step in the process, and the true carrier of genetic information is DNA.

Watson, Crick and Rosalind Franklin work out the structure of DNA and unperturbed, Frederick Sanger works out how to read DNA sequences, earning another Nobel Prize.

In the 2000s, we work out how to edit DNA, eventually culminating in editing DNA in living beings to cure genetic diseases.

4,000,000,000 years of Biology in 10 minutes

“Many were increasingly of the opinion that they’d all made a big mistake in coming down from the trees in the first place.
And some said that even the trees had been a bad move, and that no one should ever have left the oceans.”

― Douglas Adams, The Hitchhiker's Guide to the Galaxy

Let's just quickly review the last 4 billion years of biology.

Cell

Images: prokaryote: Ali Zifan CC BY-SA 4.0, via Wikimedia Commons
animal & plant): Ladyofhats, Public Domain, via Wikimedia Commons.

The basic unit of life is the cell.

A cell is a bit like a tiny computer.

Inside there's a bunch of programs called genes which tell the cell how to make proteins, which are the working machinery of the cell.

The cell is bounded by a membrane which keeps the insides in and the outsides out, and the cell interacts with the world through channels which are kind of like ports which selectively let specific molecules in and out of the cell.

Bacteria, animals and plant cells are a bit different, but they have these features in common.

DNA

Image: Zephyris, CC BY-SA 3.0, via Wikimedia Commons

Genes, the programs of the cell, are encoded as *chromosomes*, very long *DNA* molecules not unlike a tape. DNA is built up out of four "bases", Adenine, Cytosine, Guanine and Thymine, generally just abbreviated as A C G and T.

"base pairs"

base	complement
A	T
C	G
G	C
T	A

4 Mbp = 8 Mbit = 1 MB

They zip together in pairs, A complements T and C complements G, so we generally refer to this smallest piece of genetic information as a "base pair". This is the unit we'll use to compare genome sizes. Because there's four possible pairs each "base pair" is equivalent to 2 binary bits of information. So "1 million base pairs" that means 2 megabits of information.

DNA Sequencing — 1

	Organism	Base Pairs	Genes
1977	Bacteriophage ΦX174	5 kbases﹡	11
1997	Escherichia coli	4.6 Mbp	4288
1996	Brewer's yeast Saccharomyces cerevisiae	12 Mbp	6275
2000	Fruit Fly Drosophila melanogaster	120 Mbp	15k
2003	Human Homo sapiens	3 Gbp	20k

﹡ It's a single stranded DNA virus which infects bacteria, so it's not really "base pairs"

Bacteria have relatively small genomes, typically a single circular chromosome of hundreds of thousands through to a million or so base pairs. The human genome by contrast has about 3 billion base pairs, and each of us have two copies spread over multiple chromosomes.

DNA Sequencing — 2

Image: Abizar at English Wikipedia, CC BY-SA 3.0, Link

We're a lot more complicated than a bacteria, but there's a species of lungfish with 130 billion base pairs and an amoeba with 670 billion base pairs.
So who's counting?

DNA Sequencing — 3

the cost is plotted on a log scale!
precision medicine
pangenome (going beyond a reference genome)

Image: National Human Genome Research Institute

Genes and Evolution

Original	ACAGAGCAGGTGGCCCTG	g.=
Substitution	ACAGTGCAGGTGGCCCTG	g.5A>T
Insertion	ACAGAGCCGATAGGTGGCCCTG	g.7_8insCGAT
Deletion	ACAGAGCT~~AGG~~GGCCCTG	g.8delAGG
Duplication	ACAGAGCAGGTGGCCAGGTGGCCCTG	g.8_15dup

Radiation, replication errors, retroviruses
Most changes are unhelpful
But over vast time scales ...

Unicellular organisms reproduce by copying, and errors arise during the copying process, sometimes leading to novel features in the cells. Substitions, Insertion, Deletion, Duplication and more unusual errors like inversion cause changes in the DNA.

These changes can cause a gene to be shortened or lengthened or split into pieces or fused with another gene. Bigger errors can also result in multiple copies of a gene appearing, and then these multiple copies can evolve in different directions. Changes can lead to parts of the genome effectively being commented out.

On top of this, pesky things like retroviruses can introduce new genes entirely, effectively writing themselves into a cell's genome to get the cell to make more retroviruses.

It's pretty rare that any of these changes are helpful. Most often, they just result in a broken gene. But over vast timescales, the helpful changes add up. Good features thrive, bad features dwindle, and that's natural selection, leading to evolution.

Horizontal Gene Transfer

Jonasz Patkowski, CC BY-SA 4.0, via Wikimedia Commons

Bacteria also exchange genes with different bacteria or even other organisms, by passing *plasmids*, which are like mini chromosomes. This is called *horizontal gene transfer*.

It may seem counter-intuitive to give away your code to your competitors, but consider: the thing evolution is optimizing here is not the organism, but the gene. If a gene which confers, say, antibiotic resistance gets copied into a new organism and that organism thrives, then the gene is successful even if the old organism is out-competed by the new. The *gene* will get passed on to more organisms.

Yep, Free Software is billions of years old, and there's been a whole lot of cutting and pasting from the primordial script archive.

The Central Dogma — 1

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

The "source code" of the genome is DNA, but DNA is just a stable storage for the genome, it doesn't actually **do** anything. Instead, DNA is *transcribed* into RNA, and then RNA is *translated* into proteins.
This is often called the "central dogma" of cellular biology.

The Central Dogma — 2

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

The Central Dogma — 3

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

There's also cell replication to consider. Retroviruses can translate RNA back into DNA so we'd better include that There's also *non-coding RNA* which functions directly rather than being translated into a protein first.

The Central Dogma — 4

Image: derived from tazvld,Squidonius,toony, CC BY-SA 4.0, via Wikimedia Commons

Transcription from DNA to RNA is done by a protein complex called *RNA polymerase (RNAP)*. This little machine unzips the DNA and assembles an RNA molecule, but RNAP is itself made of proteins.
Proteins are produced from genes, so to build the transcription mechanism we first have to run the transcription mechanism ...

There's also *splicing* and *translation* to consider. DNA is transcribed into RNA but that RNA isn't in it's final form, first it needs to be *spliced*. This is done by, yep, more ncRNA and more proteins. It's a really complicated process,
So RNA is self-modifying code, and one piece of RNA can affect the way another piece is expressed, the proteins it produces.

Bootstrapping

Images: boot: Auckland Museum, CC BY 4.0, via Wikimedia Commons
boot disk: archive.org (who as well as an image of a boot disk have a boot disk image)

Before we can compile the compiler, we need to compile the compiler.
In computing, we call this *bootstrapping*.
You start off by using very primitive tools, possibly even a pencil, to create a very simple first compiler, and then using that compiler you can build a more sophisticated compiler, and so on.
All of this only works because the parent cell contained enough of these mechanisms to get the whole process started. Kind of a boot disk. Those molecules can then make more of themselves, and the cell continues to run.
You can think of these genes, as an "operating system" which the rest of a cell's biology is implemented on top of.
There's no documentation or source control, but by looking at the common features of all cellular life we can hypothesize a "univeral common ancestor" which arose about 4 billion years ago, and from which all life is descended.

ATP Synthase

Image: PDB-101

I've been talking about proteins as little machines, and just to make sure you don't think this is some kind of metaphor, this is an animation of a protein complex called ATP synthase.

Translation

1st	2nd								3rd
1st	U		C		A		G		3rd
U	UUU	Phe	UCU	Ser	UAU	Tyr	UGU	Cys	U
	UUC	Phe	UCC		UAC	Tyr	UGC	Cys	C
	UUA	Leu	UCA		UAA	STOP	UGA	STOP	A
	UUG	Leu	UCG		UAG	STOP	UGG	Trp	G
C	CUU	Leu	CCU	Pro	CAU	His	CGU	Arg	U
	CUC		CCC		CAC	His	CGC		C
	CUA		CCA		CAA	Gln	CGA		A
	CUG		CCG		CAG	Gln	CGG		G
A	AUU	Ile	ACU	Thr	AAU	Asn	AGU	Ser	U
	AUC		ACC		AAC	Asn	AGC	Ser	C
	AUA		ACA		AAA	Lys	AGA	Arg	A
	AUG	START / Met	ACG		AAG	Lys	AGG	Arg	G
G	GUU	Val	GCU	Ala	GAU	Asp	GGU	Gly	U
	GUC		GCC		GAC	Asp	GGC		C
	GUA		GCA		GAA	Glu	GGA		A
	GUG		GCG		GAG	Glu	GGG		G

Translation from RNA to Protein isn't simple either. A protein is a long chain of amino acids, which is built up by a complex molecular machine called a *ribosome*, built from ncRNA and proteins.
The mapping from RNA to protein is done by *transfer RNA (tRNA)*. Groups of three bases called *codons* correspond to different tRNAs which each bring an amino acid molecule to add onto the protein.
This table shows the "standard code" which maps codons to amino acids. Not all organisms have the exact same code: this table isn't a law of nature. It's a result of what tRNA happen to be around, and tRNA is produced from the genome, so this translation is happening "in software".
It somewhat resembles the instruction decoding tables used in microprocessors — there's some redundancy where 64 codons translate to 20 amino acids. Translation starts at a "start" codon and finishes when it reaches a "stop" codon.
But there's no looping or branching or I/O or whatever, so how is this like a program?

Gene Regulation — 1

Image: Thomas Shafee, CC BY 4.0, via Wikimedia Commons

Well, genes are not expressed equally. When genes are packed into a chromosome, that's kind of like a library. There's some header information before and after each gene, known as the "Regulatory Sequences". These affect the way RNA Polymerase attaches to DNA, and how Ribosomes attach to RNA. Genes, or groups of genes, can be promoted or suppressed. Splicing can be suppressed or altered to produce different proteins.
All this is under the control of the molecules within the cell, which themselves exist through the action of other genes or from external stimuli.

Gene Regulation — 2

In E. coli, a group of genes (lac) make enzymes to break lactose down into galactose and glucose.
The presense of lactose promotes these genes
The presense of glucose represses these genes
Therefore, enzymes are only produced when useful and needed

if lactose and not glucose:
    make_enzymes()

Gene Regulation — 3

Image: "Gold-standard gene regulatory network #1"(CC-A-4.0)

Regulation is not always by external stimuli: genes can regulate other genes, positively or negatively, in what are called Gene Regulatory Networks.
If each gene is a statement, the cell's "program" is found in the interaction between those statements. And those interactions are extremely complicated.

Symbiogenesis

Image: Chiswick Chap. Redrawn from File:Symbiogenesis.svg to reflect more recent science., CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

At some point about 3 billion years ago, a particularly entrepreneurial branch of the Archaea developed a more sophisticated internal cell structure. These are the *Eukaryotes*.

Eukaryote Cell

mitochondria

Image: LadyofHats, Public domain, via Wikimedia Commons

Animals (including humans) are Eukaryotes. Also plants, fungi, algae and slime molds.

The Eukaryotic genome is protected inside a *nucleus*, and specialist *organelles* perform specific functions within the cell. They're linked together by the *cytoskeleton*, a network of filaments and tubes which spans the inside of the cell.

You can think of these as peripheral controllers or coprocessors supporting the main processor, linked by buses. DNA to RNA transcription occurs in the nucleus, and then RNA to protein translation occurs in the *endoplasmic reticulum* surrounding the nucleus.

Yep, we've just added a separation between kernel and userland programs.

Also, some of these organelles, the *mitochondria*, appear to once have been independent Proteobacteria, which were engulfed by the proto-Eukaryotes and instead of being destroyed they were put to work.

Mitochondria are like little containers with their own separate genome. They have their own DNA outside the nuclear DNA. They have their own replication, transcription and translation mechanisms. The cell benefits from their ability to produce ATP, and the mitochondria benefit from the cell's protection.

But they're effectively running their operating system, like little containers within the host cell. They do their own RNA translation, using their own tRNA, and they happen to use a slightly different translation code to the host cell. They've been engulfed but they're still separate and running their own OS.

Plant Cell

mitochondria and chloroplasts

Image: LadyofHats, Public domain, via Wikimedia Commons

We like to think of vertebrates as the pinnacle of evolution but plants have done this twice! In addition to harnessing proteobacteria as mitochondria, plants have harnessed photosynthetic cyanobacteria as *chloroplasts*.

You're probably familiar with the phrase "Embrace, Extend, Extinguish"[^eee] referring to the way proprietary systems can use and then destroy open ones. This may be happening here too: there's evidence that the functions of mitochondria are slowly migrating into the nuclear DNA and being lost from mitochondria.

Multicellular Life

Many (not all) Eukaryotes are multicellular
Specialization at runtime
A network of cells, communicating by molecules
Slime Moulds and some bacteria: *sometimes* multicellular

Most (not all) Eukaryotes are multicellular, an organism is made up of many, many cells. Every cell in an organism has the same DNA, the same programming, and what role it ends up performing depends on how it is specialized.
You can think of a the DNA as a container image which gets specialized at runtime. The cells in your kidneys contain all the instructions needed to be a brain cell, they just don't use the brain-specific ones, and vice-versa.
The cells in a multicellular organism each have their own separate existence but also communicate by exchanging molecules. The multicellular organism is a network!
Even weirder, some organisms like slime moulds and some bacteria can form multicellular colonies which act somewhat like a multicellular organism but which can also break apart into unicellular life, depending on conditions.

Applications

Everyone complains about the laws of physics,
but no one does anything about them.

— Greg Egan, Schild’s Ladder

Genetic Disorders

Gene	Protein	Function	Symptoms
HBB	β-globulin	Blood oxygen transport	Sickle-cell anemia
G6PD	G6PD	Anti-oxidant	Haemolytic anemia
F9	Factor IX	Blood clotting	Haemophilia B
BRCA1	BRCA1	Tumor supression	Increased cancer susceptibility

G6PD — 1

Sequence	DNA variant	Protein variant	Type
ACAGAGCAGGTGGCCCTG	g.1G>A	p.Ala1Thr	missense
CCAGAGCAGGTGGCCCTG	g.1G>C	p.Ala1Pro	missense
GAAGAGCAGGTGGCCCTG	g.2C>A	p.Ala1Glu	missense
GCAAAGCAGGTGGCCCTG	g.4G>A	p.Glu2Lys	missense
GCACAGCAGGTGGCCCTG	g.4G>C	p.Glu2Gln	missense
GCAGAACAGGTGGCCCTG	g.6G>A	p.Glu2=	synonymous
GCAGACCAGGTGGCCCTG	g.6G>C	p.Glu2Asp	missense
GCAGAGAAGGTGGCCCTG	g.7C>A	p.Gln3Lys	missense
GCAGAGTAGGTGGCCCTG	g.7C>T	p.Gln3Ter	nonsense
GCAGAGCAAGTGGCCCTG	g.9G>A	p.Gln3=	synonymous
GCAGAGCACGTGGCCCTG	g.9G>C	p.Gln3His	missense
GCAGAGCAGATGGCCCTG	g.10G>A	p.Val4Met	missense
GCAGAGCAGCTGGCCCTG	g.10G>C	p.Val4Leu	missense
GCAGAGCAGGAGGCCCTG	g.11T>A	p.Val4Glu	missense
GCAGAGCAGGCGGCCCTG	g.11T>C	p.Val4Ala	missense
GCAGAGCAGGGGGCCCTG	g.11T>G	p.Val4Gly	missense
GCAGAGCAGGTAGCCCTG	g.12G>A	p.Val4=	synonymous

G6PD — 2

Saccharomyces cerevisiae

Mogana Das Murtey and Patchamuthu Ramasamy, CC BY 3.0, via Wikimedia Commons

G6PD — 3

Image: Nick Moore

G6PD — 4

Image: Nick Moore, unpublished preliminary results

G6PD — 5

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

G6PD — 6

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

G6PD — 7

Image: Functional evidence for G6PD variant classification from mutational scanning (Geck et al)

Debugging — 1

print("Hello, World!")

Debugging — 2

Image: Elektor Magazine "Debugging without a debugger"

Debugging — 3

Image: Queensland Brain Institute

Debugging — 3

Reference: "Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing"

mRNA & Spam — 1

Ηο𝚝 Ꮮսхս𝚛у ꓪа𝚝сℎ℮ꜱ
ıո γοᴜⲅ ⍺𝖗ҽа

Thanks to Unicode Confusables

mRNA & Spam — 2

nucleoside	modified nucleoside
uridine (U)	pseudouridine (Ψ)
	5-methyluridine (m5U)
	2-thiouridine (s2U)
adenosine (A)	N6-methyladenosine (m6A)
cytidine (C)	5-methylcytidine (m5C)

Katalin Karikó

1980s	Research into dsRNA
1985	Move to USA
1990s	Research into mRNA
1997	Collaboration with Drew Weissman
2005	Pseudouridine to prevent immune response
2013	VP BioNTech RNA Pharmaceuticals
2020	BioNTech/Pfizer and Moderna mRNA vaccines for COVID-19
2023	Nobel Prize in Physiology and Medicine (with DW)

Thanks!

<blink> Politics </blink>

Open Collaboration
Immunotherapy for cancer
Huge potential for developing world
Imagination!

Image: Mogana Das Murtey and Patchamuthu Ramasamy, CC BY 3.0, via Wikimedia Commons