Press "." to enter slideshow, once entered "PgUp" and "PgDn" to
navigate, "Esc" to leave slideshow again.
The Primordial Code
You are running on an operating system a couple of billion years old
which is full of primordial libraries, monkey patches, self-modifying
code, viral hacks and even containers running a different operating system.
Content Warnings |
---|
Mention of infectious and genetic diseases,
including cancer, but nothing gruesome. |
Background - Me
Nick Moore
nick.zoic.org
Consultant: software development, systems architecture.
mnemote.com
Photo: Charlotte Moore
My name's Nick Moore, I'm a consultant, working in software development and systems architecture.
I've been working with computers for a very long time and Python for a fair proportion of that.
Previously
- MicroPython
- Internet of Things
- Visual Programming
- Functional Programming in Javascript
- NoSQL in Postgres
Bioinformatician?
I've previously presented at PyConAU and related conferences on
MicroPython, Internet of Things, Visual Programming, Functional Programming in Javascript, NoSQL in Postgres.
Recently I've been working in bioinformatics.
Bioinformatics is the study of biological systems using numerical analysis.
Biological systems are pretty complex, so typically this analysis requires computers.
I'm been lucky enough to get involved in that part.
Thanks & Apologies
Thanks to my colleagues at
WEHI (Walter and Eliza Hall Institute of Medical Research),
University of Washington Genome Sciences
and the Brotman Baty Institute.
Mistakes and oversimplifications are all mine!
Thanks to my colleagues at Walter and Eliza Hall Institute of Medical Research and at University of Washington Genome Sciences and at the Brotman Baty Institute for their patience while I've found my feet in this field.
Mistakes and oversimplifications are all mine!
Slides, Notes, Errata
https://nick.zoic.org/PYCON25/
Bioinformatics — 1
1833 | Charles Babbage's Analytical Engine (not completed) |
1842 | Punched Tapes for telegraphy |
1866 | Gregor Mendel identifies "discrete inheritable units" |
1858 | Charles Darwin publishes his theory of natural selection |
1869 | DNA isolated (an enormously long molecule, purpose unknown) |
1870 | T. H. Huxley "Biogenesis and Abiogenesis" |
Bioinformatics is the numerical study of biological systems.
Since biological systems tend to be pretty complicated, this involves computers.
Now you've got two problems.
A lot of the foundations of both fields were laid in the 19th century.
Gregor Mendel worked out that inheritence worked using discrete inheritable
units, he didn't use the word 'gene' but that's the idea.
While Charles Babbage and Ada Lovelace were working on the idea that
machines could perform calculations, Charles Darwin and Thomas Huxley were
working on natural selection and how life could possibly have
gotten started.
In the meantime, the DNA molecule got discovered, an enormously long molecule
with no known purpose. We'll come back to that shortly.
Bioinformatics — 2
1913 | Morgan & Sturtevant "genetic linkage maps" show linear structure of genetics. |
1927 | Nikolai Koltsov proposes "giant hereditary molecule" (but runs afoul of his government) |
1936 | Alan Turing's "universal computing machines" which use an (infinite) tape. |
1937 | Claude Shannon & foundation of digital computing. |
1940 | Bletchley Park "Bombe" for cryptanalysis |
1940s | Frederick Sanger: Protein Sequencing |
1952 | Alan Turing works on morphogenesis (but runs afoul of his government) |
During the 20th century, researchers worked out that some genes were
more closely coupled than others, and this indicated a linear layout of
genes in some kind of linked structure.
Nikolai Koltsov proposed a giant molecule as the store of genetic information,
but this didn't fit well with the Lysenkoist theories of Soviet Russia,
landing him in hot water with his government and leading to his early death.
In the meantime, Alan Turing and Claude Shannon worked out theoretical
and practical bases for computing, leading to the invention of cryptanalysis
and the cracking of the Nazi Enigma codes. After the war, Turing would turn
his attention to morphogenesis, the way organisms grow complex structures,
but also ended up getting government attention and an untimely death.
Proteins were known to be composed of a long string of amino acids, and
thought to be something to do with inheritance.
Frederick Sanger earns a Nobel Prize for working out how to read these
sequences.
Bioinformatics — 3
1944 | DNA as the carrier of genetic information |
1953 | Watson, Crick and Franklin discover the structure of DNA |
1970s | Frederick Sanger: DNA sequencing |
2005 | DNA editing with CRISPR |
2015 | DNA therapy |
2019 | in-body gene editing |
However, it is soon discovered that proteins are a later step in the process,
and the true carrier of genetic information is DNA.
Watson, Crick and Rosalind Franklin work out the structure of DNA and
unperturbed, Frederick Sanger works out how to read DNA sequences, earning
another Nobel Prize.
In the 2000s, we work out how to edit DNA, eventually culminating in
editing DNA in living beings to cure genetic diseases.
4,000,000,000 years of Biology in 10 minutes
“Many were increasingly of the opinion that they’d all made a big mistake
in coming down from the trees in the first place.
And some said that even
the trees had been a bad move, and that no one should ever have left the oceans.”
― Douglas Adams, The Hitchhiker's Guide to the Galaxy
Let's just quickly review the last 4 billion years of biology.
The basic unit of life is the cell.
A cell is a bit like a tiny computer.
Inside there's a bunch of programs called genes which tell the cell
how to make proteins, which are the working machinery of the cell.
The cell is bounded by a membrane which keeps the insides in and the
outsides out, and the cell interacts with the world through
channels which are kind of like ports which selectively
let specific molecules in and out of the cell.
Bacteria, animals and plant cells are a bit different, but they have
these features in common.
Genes, the programs of the cell, are encoded as *chromosomes*, very long *DNA* molecules not unlike a tape.
DNA is built up out of four "bases", Adenine, Cytosine, Guanine and Thymine, generally just abbreviated as A C G and T.
"base pairs"
base | complement |
A | T |
C | G |
G | C |
T | A |
4 Mbp = 8 Mbit = 1 MB
They zip together in pairs, A complements T and C complements G, so we generally refer to this smallest piece of genetic information as a "base pair".
This is the unit we'll use to compare genome sizes.
Because there's four possible pairs each "base pair" is equivalent to 2 binary bits of information.
So "1 million base pairs" that means 2 megabits of information.
DNA Sequencing — 1
| Organism | Base Pairs | Genes |
1977 | Bacteriophage ΦX174 | 5 kbases﹡ | 11 |
1997 | Escherichia coli | 4.6 Mbp | 4288 |
1996 | Brewer's yeast Saccharomyces cerevisiae | 12 Mbp | 6275 |
2000 | Fruit Fly Drosophila melanogaster | 120 Mbp | 15k |
2003 | Human Homo sapiens | 3 Gbp | 20k |
﹡ It's a single stranded DNA virus which infects bacteria, so it's not really "base pairs"
Bacteria have relatively small genomes, typically a single circular chromosome of hundreds of thousands through to a million or so base pairs.
The human genome by contrast has about 3 billion base pairs, and each of us have two copies spread over multiple chromosomes.
We're a lot more complicated than a bacteria, but there's a species of lungfish with 130 billion base pairs and an amoeba with 670 billion base pairs.
So who's counting?
Genes and Evolution
Original | ACAGAGCAGGTGGCCCTG | g.= |
Substitution | ACAGTGCAGGTGGCCCTG | g.5A>T |
Insertion | ACAGAGCCGATAGGTGGCCCTG | g.7_8insCGAT |
Deletion | ACAGAGCTAGGGGCCCTG | g.8delAGG |
Duplication | ACAGAGCAGGTGGCCAGGTGGCCCTG | g.8_15dup |
- Radiation, replication errors, retroviruses
- Most changes are unhelpful
- But over vast time scales ...
Unicellular organisms reproduce by copying, and errors arise during the copying process, sometimes leading to novel features in the cells.
Substitions, Insertion, Deletion, Duplication and more unusual errors like inversion cause
changes in the DNA.
These changes can cause a gene to be shortened or lengthened or split into pieces or fused with another gene.
Bigger errors can also result in multiple copies of a gene appearing, and then these multiple copies can evolve in different directions.
Changes can lead to parts of the genome effectively being commented out.
On top of this, pesky things like retroviruses can introduce new genes entirely, effectively writing themselves into a cell's genome to get the cell to make more retroviruses.
It's pretty rare that any of these changes are helpful.
Most often, they just result in a broken gene.
But over vast timescales, the helpful changes add up.
Good features thrive, bad features dwindle, and that's natural selection, leading to evolution.
Bacteria also exchange genes with different bacteria or even other organisms, by passing *plasmids*, which are like mini chromosomes.
This is called *horizontal gene transfer*.
It may seem counter-intuitive to give away your code to your competitors, but consider: the thing evolution is optimizing here is not the organism, but the gene.
If a gene which confers, say, antibiotic resistance gets copied into a new organism and that organism thrives, then the gene is successful even if the old organism is out-competed by the new.
The *gene* will get passed on to more organisms.
Yep, Free Software is billions of years old, and there's been a whole lot of cutting and pasting from the primordial script archive.
The "source code" of the genome is DNA, but DNA is just a stable storage for the genome, it doesn't actually **do** anything.
Instead, DNA is *transcribed* into RNA, and then RNA is *translated* into proteins.
This is often called the "central dogma" of cellular biology.
There's also cell replication to consider.
Retroviruses can translate RNA back into DNA so we'd better include that
There's also *non-coding RNA* which functions directly rather than being translated into a protein first.
Transcription from DNA to RNA is done by a protein complex called *RNA polymerase (RNAP)*. This little machine unzips the DNA and assembles an RNA molecule, but RNAP is itself made of proteins.
Proteins are produced from genes, so to build the transcription mechanism we first have to run the transcription mechanism ...
There's also *splicing* and *translation* to consider.
DNA is transcribed into RNA but that RNA isn't in it's final form, first it needs to be *spliced*.
This is done by, yep, more ncRNA and more proteins.
It's a really complicated process,
So RNA is self-modifying code, and one piece of RNA can affect the way another piece is expressed, the proteins it produces.
Before we can compile the compiler, we need to compile the compiler.
In computing, we call this *bootstrapping*.
You start off by using very primitive tools, possibly even a pencil, to create a very simple first compiler, and then using that compiler you can build a more sophisticated compiler, and so on.
All of this only works because the parent cell contained enough of these mechanisms to get the whole process started.
Kind of a boot disk.
Those molecules can then make more of themselves, and the cell continues to run.
You can think of these genes, as an "operating system" which the rest of a cell's biology is implemented on top of.
There's no documentation or source control, but by looking at the common features of all cellular life we can hypothesize a "univeral common ancestor" which arose about 4 billion years ago, and from which all life is descended.
I've been talking about proteins as little machines, and just to make sure you don't
think this is some kind of metaphor, this is an animation of a protein complex called
ATP synthase.
Translation
1st | 2nd | 3rd |
U | C | A | G |
U | UUU | Phe | UCU | Ser | UAU | Tyr | UGU | Cys | U |
UUC | UCC | UAC | UGC | C |
UUA | Leu | UCA | UAA | STOP | UGA | STOP | A |
UUG | UCG | UAG | UGG | Trp | G |
C | CUU | Leu | CCU | Pro | CAU | His | CGU | Arg | U |
CUC | CCC | CAC | CGC | C |
CUA | CCA | CAA | Gln | CGA | A |
CUG | CCG | CAG | CGG | G |
A | AUU | Ile | ACU | Thr | AAU | Asn | AGU | Ser | U |
AUC | ACC | AAC | AGC | C |
AUA | ACA | AAA | Lys | AGA | Arg | A |
AUG | START / Met | ACG | AAG | AGG | G |
G | GUU | Val | GCU | Ala | GAU | Asp | GGU | Gly | U |
GUC | GCC | GAC | GGC | C |
GUA | GCA | GAA | Glu | GGA | A |
GUG | GCG | GAG | GGG | G |
Translation from RNA to Protein isn't simple either.
A protein is a long chain of amino acids, which is built up by a complex molecular machine called a *ribosome*, built from ncRNA and proteins.
The mapping from RNA to protein is done by *transfer RNA (tRNA)*.
Groups of three bases called *codons* correspond to different tRNAs which each bring an amino acid molecule to add onto the protein.
This table shows the "standard code" which maps codons to amino acids.
Not all organisms have the exact same code: this table isn't a law of nature.
It's a result of what tRNA happen to be around, and tRNA is produced from the genome, so this translation is happening "in software".
It somewhat resembles the instruction decoding tables used in microprocessors — there's some redundancy where 64 codons translate to 20 amino acids.
Translation starts at a "start" codon and finishes when it reaches a "stop" codon.
But there's no looping or branching or I/O or whatever, so how is this like a program?
Well, genes are not expressed equally.
When genes are packed into a chromosome, that's kind of like a library.
There's some header information before and after each gene, known as the "Regulatory Sequences".
These affect the way RNA Polymerase attaches to DNA, and how Ribosomes attach to RNA.
Genes, or groups of genes, can be promoted or suppressed.
Splicing can be suppressed or altered to produce different proteins.
All this is under the control of the molecules within the cell, which themselves exist through the action of other genes or from external stimuli.
Gene Regulation — 2
- In E. coli, a group of genes (lac) make enzymes to break lactose
down into galactose and glucose.
- The presense of lactose promotes these genes
- The presense of glucose represses these genes
- Therefore, enzymes are only produced when useful and needed
if lactose and not glucose:
make_enzymes()
Regulation is not always by external stimuli: genes can regulate other genes,
positively or negatively, in what are called Gene Regulatory Networks.
If each gene is a statement, the cell's "program" is found in the interaction between those statements.
And those interactions are extremely complicated.
At some point about 3 billion years ago, a particularly entrepreneurial branch of the Archaea developed a more sophisticated internal cell structure.
These are the *Eukaryotes*.
Eukaryote Cell
mitochondria
Image:
LadyofHats, Public domain, via Wikimedia Commons
Animals (including humans) are Eukaryotes.
Also plants, fungi, algae and slime molds.
The Eukaryotic genome is protected inside a *nucleus*, and specialist *organelles* perform specific functions within the cell.
They're linked together by the *cytoskeleton*, a network of filaments and tubes which spans the inside of the cell.
You can think of these as peripheral controllers or coprocessors supporting the main processor, linked by buses.
DNA to RNA transcription occurs in the nucleus, and then RNA to protein translation occurs in the *endoplasmic reticulum* surrounding the nucleus.
Yep, we've just added a separation between kernel and userland programs.
Also, some of these organelles, the *mitochondria*, appear to once have been independent Proteobacteria, which were engulfed by the proto-Eukaryotes and instead of being destroyed they were put to work.
Mitochondria are like little containers with their own separate genome.
They have their own DNA outside the nuclear DNA.
They have their own replication, transcription and translation mechanisms.
The cell benefits from their ability to produce ATP, and the mitochondria benefit from the cell's protection.
But they're effectively running their operating system, like little containers within the host cell.
They do their own RNA translation, using their own tRNA, and they happen to use a slightly different translation code to the host cell.
They've been engulfed but they're still separate and running their own OS.
Plant Cell
mitochondria and chloroplasts
Image: LadyofHats, Public domain, via Wikimedia Commons
We like to think of vertebrates as the pinnacle of evolution but plants have done this twice!
In addition to harnessing proteobacteria as mitochondria, plants have harnessed photosynthetic cyanobacteria as *chloroplasts*.
You're probably familiar with the phrase "Embrace, Extend, Extinguish"[^eee]
referring to the way proprietary systems can use and then destroy open ones.
This may be happening here too: there's evidence that the functions of mitochondria are slowly migrating into the nuclear DNA and being lost from mitochondria.
Multicellular Life
- Many (not all) Eukaryotes are multicellular
- Specialization at runtime
- A network of cells, communicating by molecules
- Slime Moulds and some bacteria: *sometimes* multicellular
Most (not all) Eukaryotes are multicellular, an organism is made up of many,
many cells.
Every cell in an organism has the same DNA, the same programming, and what role it ends up performing depends on how it is specialized.
You can think of a the DNA as a container image which gets specialized at runtime.
The cells in your kidneys contain all the instructions needed to be a brain cell, they just don't use the brain-specific ones, and vice-versa.
The cells in a multicellular organism each have their own separate existence but also communicate by exchanging molecules.
The multicellular organism is a network!
Even weirder, some organisms like slime moulds and some bacteria can form multicellular colonies which act somewhat like a multicellular organism but which can also break apart into unicellular life, depending on conditions.
Applications
Everyone complains about the laws of physics,
but no one does anything about them.
— Greg Egan, Schild’s Ladder
Genetic Disorders
Gene |
Protein |
Function |
Symptoms |
HBB |
β-globulin |
Blood oxygen transport |
Sickle-cell anemia |
G6PD |
G6PD |
Anti-oxidant |
Haemolytic anemia |
F9 |
Factor IX |
Blood clotting |
Haemophilia B |
BRCA1 |
BRCA1 |
Tumor supression |
Increased cancer susceptibility |
G6PD — 1
Sequence | DNA variant | Protein variant | Type |
ACAGAGCAGGTGGCCCTG | g.1G>A | p.Ala1Thr | missense |
CCAGAGCAGGTGGCCCTG | g.1G>C | p.Ala1Pro | missense |
GAAGAGCAGGTGGCCCTG | g.2C>A | p.Ala1Glu | missense |
GCAAAGCAGGTGGCCCTG | g.4G>A | p.Glu2Lys | missense |
GCACAGCAGGTGGCCCTG | g.4G>C | p.Glu2Gln | missense |
GCAGAACAGGTGGCCCTG | g.6G>A | p.Glu2= | synonymous |
GCAGACCAGGTGGCCCTG | g.6G>C | p.Glu2Asp | missense |
GCAGAGAAGGTGGCCCTG | g.7C>A | p.Gln3Lys | missense |
GCAGAGTAGGTGGCCCTG | g.7C>T | p.Gln3Ter | nonsense |
GCAGAGCAAGTGGCCCTG | g.9G>A | p.Gln3= | synonymous |
GCAGAGCACGTGGCCCTG | g.9G>C | p.Gln3His | missense |
GCAGAGCAGATGGCCCTG | g.10G>A | p.Val4Met | missense |
GCAGAGCAGCTGGCCCTG | g.10G>C | p.Val4Leu | missense |
GCAGAGCAGGAGGCCCTG | g.11T>A | p.Val4Glu | missense |
GCAGAGCAGGCGGCCCTG | g.11T>C | p.Val4Ala | missense |
GCAGAGCAGGGGGCCCTG | g.11T>G | p.Val4Gly | missense |
GCAGAGCAGGTAGCCCTG | g.12G>A | p.Val4= | synonymous |
G6PD — 3
Image: Nick Moore
G6PD — 4
Image: Nick Moore, unpublished preliminary results
Debugging — 1
print("Hello, World!")
mRNA & Spam — 2
nucleoside | modified nucleoside |
uridine (U) | pseudouridine (Ψ) |
5-methyluridine (m5U) |
2-thiouridine (s2U) |
adenosine (A) | N6-methyladenosine (m6A) |
cytidine (C) | 5-methylcytidine (m5C) |
Katalin Karikó
1980s | Research into dsRNA |
1985 | Move to USA |
1990s | Research into mRNA |
1997 | Collaboration with Drew Weissman |
2005 | Pseudouridine to prevent immune response |
2013 | VP BioNTech RNA Pharmaceuticals |
2020 | BioNTech/Pfizer and Moderna mRNA vaccines for COVID-19 |
2023 | Nobel Prize in Physiology and Medicine (with DW) |
Thanks!
<blink>
Politics
</blink>
- Open Collaboration
- Immunotherapy for cancer
- Huge potential for developing world
- Imagination!